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Chapter  1 

Introduction 


Functional  languages  support  a  powerful  programming  methodology  based  on  higher-order  func¬ 
tions  and  infinite  objects.  Furthermore,  they  admit  diverse  implementation  strategies.  While 
lazy  functional  languages  have  more  expressive  power  than  eager  languages,  eager  functional 
languages  are  more  amenable  to  parallel  execution.  The  goal  of  this  thesis  is  to  extend  the  power 
of  the  dataflow  language  Id  [37],  a  non-strict  eager  language,  and  its  implementations  [7,  38]  by 
introducing  a  limited  form  of  lazy  evaluation;  lazy  data-structures  allow  the  evaluation  of  the 
contents  of  a  data  structure  slot  to  be  postponed  (perhaps  forever)  until  a  consumer  reads  the 
slot,  thus  extending  the  expressive  power  of  Id. 

Several  terms  require  clarification.  Section  1.1  discusses  what  we  mean  by  the  terms  “strict”, 
“non-strict”,  “lazy”,  and  “eager”  as  ways  of  describing  functional  languages  and  interpreters. 
Section  1.2.1  presents  two  interesting  programming  paradigms  that  are  facilitated  by  lazy  data- 
structures,  and  Section  1.2.2  discusses  the  costs  of  lazy  evaluation. 


1.1  Terminology 

Most  programming  languages  are  strict  (i.e.,  have  strict  semantics).  This  means  that  the 
arguments  to  a  procedure  are  evaluated  before  the  procedure  is  cadled.  Consider  the  following 
expression  that  constructs  a  pair  using  the  cons  procedure  and  selects  the  first  element  of  the 
pair  using  the  head  procedure: 

head  (cons  expl  exp2} 

Since  the  argument  expressions  expl  and  exp2  are  evaluated  before  calling  cons,  exp2  is 
evaluated  even  though  the  result  of  the  evaluation  is  not  needed  to  produce  the  overeJl  result. 
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If  tlip  evaluation  of  either  expl  or  exp2  diverges  (goes  into  an  infinite  loop),  evaluation  of  the 
overall  expression  diverges  too,  and  the  overall  expression  never  produces  a  result.  Termination 
and  producing  a  result  for  the  entire  expression  depend  on  the  termination  of  exp2,  even  though 
the  result  is  not  needed. 

One  exception  to  the  “arguments  first”  rule  is  allowed.  The  consequent  and  alternate  of 
a  conditional  expression,  which  can  be  syntactically  identi  User  not  logged  in  or  not  receiving  messages.  1- 
the  predicate  is  resolved  only  one  of  them  is  evaluated.  This  is  necessary  to  permit  recursive 
definitions.  It  is  not  possible  for  users  to  define  conditionals  with  procedures,  however,  as  we 
discuss  in  Section  1.2.4. 

Lazy  functional  languages  are  a  subset  of  non-strict  functional  languages.  Computation 
is  never  performed  unless  the  result  is  required  to  produce  the  overall  result.  In  the  example 
above,  in  a  lazy  functional  language,  the  overall  result  does  not  depend  on  exp2,  and  exp2  would 
not  be  evaluated.  This  is  accomplished  by  calling  procedures  before  evaluating  arguments:  the 
evaluation  of  an  argument  is  delayed  until  its  value  is  found  to  be  required  to  produce  the 
result.  technique  for  delaying  the  evaluation  of  an  expression  is  described  in  Section  1.2.2. 

Lazy  functional  languages  have  an  inherently  sequential  aspect,  too:  procedures  are  called 
before  arguments  are  evaluated.  The  compiler  may  compile  code  to  evaluate  an  argument  before 
or  in  parallel  with  calling  a  procedure,  for  example,  as  long  as  it  can  be  sure  that  the  semantics 
of  the  program  are  the  same.  Strictness  analysis,  a  technique  for  accomplishing  this  type  of 
o])timization,  is  discussed  in  Section  1.2.3. 

Since  an  expression  is  evaluated  only  if  its  result  is  required  to  produce  the  overall  result, 

.'in  unbounded  amount  of  computation  is  never  performed  once  the  answer  has  been  computed. 

Said  another  w.^y,  producing  a  result  and  termination  are  inextricably  tied.  If  a  computation 
does  not  terminate,  no  result  will  be  produced. 

Lazy  functional  languages  do  not  cover  all  non-strict  functional  languages.  Id  ^  [37]  is  a  non- 
strict  non-lazy  language.  Producing  the  result  of  an  Id  program  does  not  imply  termination. 

In  the  above  example,  in  Id,  production  of  a  result  depends  on  termination  of  the  evaluation  of 
expl,  and  overall  termination  depends  on  the  termination  of  both  expl  and  exp2.  “Non-lazy” 
is  called  “eager”,  the  complement  of  “lazy”.  An  expression  might  be  evaluated  even  if  its  result 
is  not  a  prerequisite  for  the  overall  result. 

‘  Id  is  not  a  runctional  language,  but  a  large  subset  of  Id  is  functional.  When  the  non-functionality  is  an  issue, 
it  will  be  pointed  out. 
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Just  as  non-strict  functional  languages  subsume  lazy  functional  languages,  eager  functional 
languages  subsume  strict  functional  languages  seen  in  Figure  1.1. 


strict 

non-strict 

_ _ 

strict 

eager  /  non-strict 

lazy 

eager 


Figure  1.1:  The  Functional  Language  Spectrum 


So  far,  we  have  talked  about  attributes  of  languages.  These  qualifiers  can  also  be  applied  to 
interpreters,  machines  on  which  programs  can  be  run.  An  eager  interpreter  naturally  supports 
the  interpretation  of  an  eager  language,  etc.  That  is  not  to  say  that  a  lazy  functional  language 
cannot  be  implemented  on  an  eager  interpreter,  as  we  will  see  in  several  examples. 

In  the  rest  of  this  chapter,  we  discuss  the  motivation  for  our  lazy/eager  mixture,  and  we 
discuss  other  dataflow  approaches  for  achieving  laziness. 


1.2  Lazy  Evaluation  Reconsidered 

In  this  section,  we  consider  various  aspects  of  lazy  evaluation.  We  begin  with  some  programming 
paradigms  available  in  lazy  functional  languages  but  not  eager  functional  languages.  Then,  we 
consider  the  cost  of  lazy  evaluation  and  how  that  cost  is  avoided  (to  some  extent)  in  lazy 
functional  languages.  Finally,  we  discuss  why  we  chose  an  eager  language  rather  than  a  lazy 
language  as  a  starting  point,  and  we  present  our  approach. 

1,2.1  Two  Interesting  Programming  Paradigms 

Lazy  functional  languages  facilitate  programming  paradigms  not  available  in  eager  languages. 
Two  such  paradigms  are  considered  in  this  section:  programming  with  conceptually  infinite 
data- structures  and  programming  with  data  structures  with  slots  that  are  expensive  to  compute. 

Streams  are  classic  examples  of  infinite  data  structures.  These  conceptually  infinite  lists 
expand  only  as  far  as  their  consumers  require.  As  long  as  consumers  only  traverse  a  finite 
prefix  of  a  stream,  only  a  finite  prefix  is  computed.  More  generally,  very  large  or  infinite 
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data  structures  can  be  traversed  while  newly  explored  sections  are  generated  automatically  as 
needed. 

Sometimes  particular  data  structure  positions  are  expensive  to  compute,  but  we  might 
ignore  the  value  of  the  position.  Consider,  for  example,  a  memoization  table  for  a  complex 
function.  For  every  position  that  we  avoid  computing,  significant  savings  are  realized.  We  call 
this  paradigm  “programming  with  expensive  slots”. 

Both  of  these  paradigms,  which  are  closely  tied  to  data-structures,  are  made  available  by 
the  system  developed  in  this  thesis. 

1.2.2  The  Cost  of  Lazy  Evaluation 

When  an  expression  is  evaluated,  the  interpreter  (machine)  has  some  state  a.ssociated  with  it. 
In  particular,  the  free  variables  of  the  expression  are  available.  In  order  for  the  evaluation  of 
the  expression  to  be  delayed,  provision  must  be  made  to  make  the  values  of  the  free  variables 
available  when  the  expression  is  eventually  evaluated.  One  technique  for  packaging  a  delayed 
expression  with  its  environment  is  called  thunks^.  A  thunk  is  a  piece  of  code  to  evaluate  the 
delayed  expression  after  “restoring  the  expression’s  environment”. 

Many  efficiency  issues  arise  when  thunks  are  used  to  implement  delayed  computation.  First 
we  consider  the  general  issues  of  sharing  the  result  of  a  delayed  expression  and  of  having  more 
thunks  than  necessary  to  achieve  desired  program  behavior. 

If  two  computations  have  independent  copies  of  a  thunk  and  both  computations  evaluate 
the  thunk,  the  computation  will  be  performed  twice.  Sharing  computation  is  so  important  that 
any  realistic  approach  must  incorporate  it.  Any  implementation  which  does  so  must  provide 
storage  for  the  result  of  each  thunk.  The  need  for  a  shared  location  motivated  our  restriction 
(to  be  presented)  that  delayed  expressions  sit  in  data  structures. 

Coordination  of  a  shared  resource  requires  synchronization;  in  a  sequential  machine,  where 
only  one  process  is  active  at  a  time,  the  synchronization  and  coordination  that  surrounds 
a  .shared  thunk  are  straightforward.  In  a  parallel  machine,  however,  managing  any  shared 
resource  is  complex.  The  embodiment  of  the  required  synchronization  mechanism  will  emerge 
as  the  basis  for  our  approach. 

Most  expressions  are  always  computed  and  need  not  be  delayed.  Strictness  analysis  and 

^The  term  thunk  comes  from  Algol60,  where  it  was  the  mechanism  used  to  pass  call-by-name  parameters. 
■Although  the  value  was  not  computed  yet,  the  compiler  had  already  “thunk”  how  it  would  be  done  when  the 
time  came  [18].  Each  time  the  variable  was  needed,  the  value  was  computed. 
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similax  techniques  optimize  the  implementation  of  lazy  functional  languages  by  eliminating  as 
many  superfluous  thunks  as  can  be  identified.  Since  the  starting  point  is  far  from  the  target, 
these  techniques  must  be  excellent  to  provide  reasonable  efficiency.  We  shoot  for  the  target 
from  close  range.  Annotations  point  to  the  few  lazy  exceptions. 

The  cost  of  thunks  can  be  broken  up  into  three  categories:  the  cost  of  building  thunks,  the 
cost  of  the  first  reference,  and  the  cost  of  successive  references. 

When  an  expression  is  delayed,  the  environment  must  be  saved.  In  a  system  where  environ¬ 
ments  are  manipulable  (such  as  Scheme  [41]),  the  thunk  for  the  delayed  expression  can  simply 
record  a  pointer  to  the  delayed  expression’s  lexically  enclosing  environment.  This  environment, 
however,  is  likely  to  contain  much  more  information  than  we  need  to  evaluate  the  expression. 
Since  the  shared  environment  already  exists,  using  it  makes  the  construction  of  the  thunk  fast, 
but  the  lifetime  of  the  entire  environment  is  now  tied  to  the  thunk.  The  environment  (which 
is  actually  a  sequence  of  frames  in  Scheme)  cannot  be  reclaimed  until  the  delayed  expression 
is  evaluated  or  discarded.  This  is  known  as  dragging:  the  thunk  is  dragging  the  environment. 
Another  alternative  is  to  copy  the  subset  of  the  environment  corresponding  to  the  free  variables 
of  the  delayed  expression  into  an  independent  environment.  Building  a  new  environment  takes 
more  time,  but  dragging  is  avoided. 

The  first  time  the  value  of  a  delayed  expression  is  requested,  the  expression  must  be  evalu¬ 
ated.  Assuming  that  we  do  not  wish  to  recompute  the  expression,  we  must  keep  track  of  both 
the  value  of  the  expression  the  fact  that  the  expression  has  been  evaluated.  All  references, 
both  initial  and  successive,  must  check  a  flag  to  determine  if  the  delayed  expression  has  been 
evaluated.  We  would  like  to  minimize  the  cost  of  this  recurring  expense. 

1.2.3  Decreasing  the  Cost  of  Lazy  Evaluation  in  Lazy  Functional  Languages 

The  lazy  functional  languages  community  has  developed  techniques  to  abate  the  cost  of  lazy 
evaluation.  Strictness  analysis  and  strictness  annotations  allow  some  expressions  to  be  evaluated 
eagerly.  Path  analysis  [48]  uses  several  different  kinds  of  thunks,  allowing  various  special  cases 
to  be  optimized. 

Strictness  Analysis  and  Strictness  Annotations 

Every  time  an  expression  is  delayed,  there  is  a  cost.  Strictness  analysis  is  an  automatic  technique 
and  strictness  annotations  is  a  programmer  controlled  technique  to  decrease  the  cost  of  lazy 
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evaluation  by  computing  some  expressions  eagerly. 

Most  functional  languages  have  lazy  semantics,  and  the  compiler  is  expected  to  determine 
which  expressions  can  be  evaluated  eagerly.  Progress  has  been  made  in  the  area  of  strictness 
of  higher  order  functions  [25]  and  non-flat  domains  [20],  but  the  problem  is  far  from  solved. 

A  function  is  said  to  be  strict  in  an  argument  if  the  value  of  that  argument  is  required  for 
the  function  to  produce  a  value.  Notationally,  the  function  f  is  strict  in  its  second  argument  if 
for  all  X,  f  (x,±)  =  -L.  where  ±  denotes  an  undefined  or  non-terminating  computation. 

Iludak  and  Young  show  that  the  problem  of  first-order  strictness  analysis  is  complete  in 
exponential  time  and  attribute  the  result  to  Meyer  [25].  They  go  on  to  explain  that,  since  “the 
size  of  most  functions  is  small,  the  complexity  seems  to  be  tractable  in  practice.” 

It  is  worth  noting  that  any  programmatic  strictness  analysis  technique  must  be  an  approxi¬ 
mation.  Determining  the  strictness  of  a  function  asks  a  question  about  the  termination  behavior 
of  the  function  and  is  clearly  undecidable,  in  general.  In  their  section  on  the  correctness  of  their 
algorithm,  Hudak  and  Young  e.xplain  that  their  technique  is  safe  as  it  never  declares  a  func¬ 
tion  strict  that  is  not.  So,  even  approximate  techniques  are  complex.  We  give  an  example 
demonstrating  the  difficulty  of  strictness  analysis  in  Section  5.2. 

To  assist  the  compiler,  annotations  are  sometimes  provided  to  the  programmer  to  declare 
strictness  properties  that  the  compiler  would  otherwise  have  to  deduce.  In  the  remainder  of 
this  section,  we  will  discuss  the  use  of  strictness  annotations  in  Miranda^  [44]  and  FLIC  [39]. 

Miranda  allows  the  programmer  to  annotate  algebraic  type  constructor  definitions  to  be 
strict  in  particular  arguments.  In  the  following  type  specification,  streams  of  numbers  are 
defined  to  have  strict  heads  and  lazy  tails.  The  head  is  strict  by  virtue  of  the  !  annotation, 
and  the  tail  is  lazy  by  default,  scons  stands  for  stream  cons. 


stream  : 

:  : =  empty  | 

1  scons  num!  stream 

Miranda 

The  strict /lazy  argument  pattern  establishes  a  calling  convention  for  each  data-structure 
constructor.  The  caller  is  compiled  to  pass  certain  arguments  as  values  and  the  rest  as  delayed 
expres.sions,  and  the  constructor  is  compiled  to  receive  that  pattern.  No  provision  is  available, 
however,  to  annotate  general  procedure  definitions  as  having  strict  arguments.  This  is  probably 
because  strictness  analysis  for  procedures  was  better  understood  at  the  time  that  Miranda  was 
^Miranda  is  a  trademark  of  Research  Software  Ltd. 


16 


designed  than  was  strictness  analysis  for  data  structures,  a  more  recent  development.  However, 
it  is  not  possible  to  annotate  built-in  data  type  constructors  such  as  cons,  the  list  constructor. 
Although  the  user  can  define  new  algebraic  types  with  any  strictness  pattern  desired  using 
the  built-in  types,  the  special  syntax  for  supporting  lists  is  lost,  including  dotdot  notation 
for  constructing  arithmetic  sequences  and  recurrences;  and  list  comprehension  for  generating, 
mapping,  and  filtering  lists.  Dotdot  notation  and  list  comprehension,  based  on  Zermelo-Frankel 
set  notation,  are  expressive  ways  to  specify  lists.  Furthermore,  Miranda  provides  no  facility  to 
annotate  actual  parameter  expressions  as  strict.  We  will  return  to  this  prospect  presently. 

Warren  Burton  proposes  a  variation  of  Miranda  [15].  Procedures  are  called  with  strict 
semantics,  and  data  structure  are  constructed  with  lazy  semantics.  “Partially  strict  pseudo¬ 
constructors”  are  procedures  with  annotations  attached  to  their  type  specifications.  The  cin- 
notations  indicate  varying  degrees  of  laziness.  The  following  code  defines  the  constructor  for  a 
stream  of  numbers  with  strict  heads  and  lazy  tails.  The  head  is  strict  by  default  (procedures 
have  strict  arguments  by  default),  and  the  tail  is  lazy  by  virtue  of  the  name  annotation. 

scons  : :  *  name  [*]  ->  [*] 
scons  a  b  ■  a:b 

Warren  Burton's  Miranda 


Since  the  annotations  are  attached  to  procedure  definitions,  which  are  strict  by  default,  the 
annotations  should  probably  be  called  laziness  annotations.  Three  modes  are  established  for 
passing  procedure  arguments.  Call  by  value  is  the  default  and  requires  no  explicit  annotation. 
Call  by  speculation  is  an  eager  variation  for  parallel  machines,  where  the  argument  and 
procedure  can  be  evaluated  in  parallel.  Call  by  name  is  a  lazy  evaluation  technique,  but  values 
are  not  memoized.  Burton  argues  that  it  may  be  cheaper  for  a  different  processor  to  recompute 
a  value  rather  than  to  send  the  computed  value  from  one  processor  to  another,  especially  in 
the  case  of  a  data  structure.  Besides,  the  progreunmer  can  provide  for  sharing  explicitly. 

FLIC  (Functional  Language  Intermediate  Code),  as  its  name  indicates,  is  not  intended  to 
be  a  front-end  language.  As  FLIC  is  intended  to  support  any  number  of  functional  languages 
targeted  at  any  number  of  sequential  or  parallel  architectures,  complete  facilities  are  available 
for  indicating  strictness^.  A  procedure  can  be  marked  as  strict  in  its  argument  (FLIC  procedures 

*FLIC  defines  an  annotation  to  contain  purely  pragmatic  information  which  can  be  deleted  to  derive  the 
semantics  (meaning)  of  a  program.  We  will  take  a  looser  interpretation  of  the  term  to  include  proper  programming 
constructs. 
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have  only  one  argument),  covering  both  general  procedures  and  data  structure  constructors. 
Closely  related  (and  equivalent)  is  a  sequentialization  function  that  takes  two  expressions  and 
returns  the  second,  but  not  until  the  first  completes.  The  STRICT  and  SEQ  primitives  are  defined 
by  the  following  equations: 


STRICT  f  ±  =  1 

STRICT  f  X  =  f  X 

SEQ  1  b  =  1 

SEQ  a  b  =  b 

• 

FLIC 

The  following  code®  defines  scons  as  a  stream  constructor  that  has  strict  heads  and  lazy 
tciils.  The  strict  function  ensures  that  all  first  actual  parameter  expressions  of  scons  are 


reduced  to  values  before  scons  is  applied. 

scons  =  STRICT  (\x\y  cons  x  y) 

FLIC 

We  will  consider  sequences  of  reductions  that  demonstrate  the  use  of  the  scons 

functions,  bead  and  tail  are  stream  selectors. 

and  strict 

head  (x:xs)  s  x 
tail  (x:xs)  =  xs 

FLIC 

In  the  following  sequences,  expressions  that  are  underlined  are  about  to  be  rewritten. 

head  (scons  5  X) 

=>■  head  (STRICT  (\x\y  cons  x  y)  5  ±) 

=>•  head  ((\x\y  cons  x  y)  5  X) 

=>•  head  (cons  5  X) 

^  5 

FLIC 

tail  (scons  X  5) 

=>  tail  (STRICT  (\x\y  cons  x  y)  X  5) 

=>  tail  (X  5) 

=>  tail  X 

=>  X 

FLIC 

^Backslash  is  FLIC’s  symbol  for  lambda. 
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As  a  more  complex  example  consider  the  definition  of  8trict_cons,  the  pair  constructor 
that  is  strict  in  both  arguments: 


strict.cons  =  STRICT  (\x  (STRICT  (\y  cons  x  y))) 


FLIC 


FLIC  also  provides  facilities  for  evaluating  an  expression  before  applying  a  procedure.  The 
following  let  expression  binds  the  value  of  the  argument  expression  arg.exp  to  the  name  arg_val 
and  applies  the  procedure  too  only  after  arg.val  reduces  to  a  value. 

=  arg.val  arg.exp  (SEQ  eirg.val  (foo  arg.val)) 

FLIC 


FLIC  also  provides  annotations  (as  opposed  to  the  preceding,  which  were  proper  language 
constructs)  to  indicate  both  formal  and  actual  argument  strictness.  These  annotations  are  to 
be  used  by  the  compiler  after  performing  strictness  analysis,  for  example. 

FLIC  takes  a  general  approach.  Not  only  can  a  particular  procedure  be  established  with 
a  mixed  strict/lazy  calling  convention,  but  actual  parameter  expressions  can  be  marked  in¬ 
dependently,  providing  additional  opportunities  for  savings.  In  conventional  implementations, 
the  various  mixed  calling  conventions  are  necessary  as  values  and  delayed  expressions  cannot 
be  freely  interchanged.  A  function  has  to  know  what  was  potentially  delayed,  and  what  was 
a  value.  Even  so,  an  actual  parameter  in  a  position  that  is  normally  passed  as  delayed  can 
be  marked  as  strict  to  some  advantage.  The  delayed  structure  (that  contains  a  flag  and  the 
delayed  expression  or  the  value  of  the  expression)  can  be  marked  “evaluated”  and  the  value  can 
be  computed  directly  and  stored.  A  delayed  structure  still  has  to  be  allocated  to  satisfy  the 
calling  convention,  but  the  expression  need  not  be  delayed,  and  can  even  be  computed  inline, 
providing  additional  savings.  In  this  way  we  have  a  hierarchy  of  mechanisms  corresponding  to 
increasing  efficiency. 

1.  lazy:  all  arguments  are  passed  unevaluated 

2.  definitions  are  annotated:  a  procedure  call  is  established  for  each  strict /lazy  combination, 
and  some  arguments  are  passed  evaluated,  some,  unevaluated 

3.  applications  are  annotated:  more  arguments  may  be  strict  but  still  packaged  as  evaluated 
delayed-expressions 
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4.  strict:  all  arguments  are  evaluated 

Bloss,  Hudak,  and  Young  [14]  develop  an  infrastructure  which  can  take  advantage  of  these 
kinds  of  situations  which  we  discuss  briefly  in  the  next  section. 

In  a  lazy  functional  language,  it  is  particularly  effective  to  annotate  definitions  Jis  most 
expressions  can  be  eagerly  evaluated  and  one  definitional  annotation  covers  an  entire  class  of 
instances.  In  an  eager  language,  however,  we  do  not  wish  to  proliferate  laziness  casually,  as  it 
is  rarely  needed.  For  this  reason,  we  use  annotations  only  in  actual  expressions. 

If  values  and  unevaluated  expressions  can  be  freely  interchanged  (which  implies  implicit 
forcing),  as  they  can  be  in  Multi-Lisp  [21],  annotations  at  the  definition  and  at  the  application 
are  equivalent,  the  former  being  an  abbreviation  for  many  of  the  latter.  As  a  result  of  the 
interchangeability,  no  special  calling  conventions  are  necessary.  Such  a  design  decision,  however, 
requires  architectural  support  for  an  efficient  implementation,  lest  we  require  repeated  explicit 
checks.  In  an  implementation  on  stock  hardware  with  no  architectural  support,  such  as  Multi- 
Lisp  on  Concert  [22]  or  Mul-T  [30]  (a  dialect  of  Multi-Lisp  and  T  compiled  for  the  Encore 
Multimax),  this  turns  out  to  be  quite  expensive. 

When  designing  hardware,  however,  providing  support  for  trapping  delayed  expressions  is 
usually  an  easy  extension.  Consider  Lisp  machines,  for  example,  which  trap  to  microcode  on 
all  sorts  of  exceptional  cases,  or  SOAR  [45]  (Smalltalk  on  a  RISC)  or  SPUR  [19]  which  trap  to 
software  in  exceptional  cases.  The  idea  is  to  handle  common  cases  quickly  while  trapping  and 
paying  a  penalty  in  the  less  frequent  exceptional  cases. 

We  restrict  our  attention  to  data  structures  and  develop  a  mechanism  that  is  transparent  to 
the  consumer  and  depends  on  hardware  support  for  an  efficient  implementation.  Annotations 
will  always  be  always  associated  with  the  actual  construction  of  data  structures. 

Path  Analysis 

Hudak  et  al.  develop  a  technique  called  path  analysis  for  optimizing  implementations  of  lazy 
functional  languages  [14].  The  compiler  [48]  tries  to  determine  that  a  particular  use  of  a 
thunk  is  the  first  or  the  last,  or  that  it  cannot  be  the  first  (t.e.,  the  delayed  expression  is 
already  evaluated),  by  tracing  the  path  through  the  sequential  execution.  Each  special  case  has 
opportunities  for  optimization. 

Four  ways  of  forming  thunks  for  procedure  arguments  are  described.  The  list  includes  the 
standard  Henderson-style  self  modifying  procedure,  two  forms  of  flag  and  procedure/value  cells. 
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and,  for  completeness,  a  non-delayed  mechanism.  The  four  methods,  or  modes  are: 

1.  Closure  Mode:  A  delayed  expression  is  wrapped  in  a  procedure.  The  procedure  may 
or  may  not  cache  the  value  when  it  gets  evaluated.  In  this  mode,  the  compiler  can 
rewrite  (DELAY  (FORCE  x))  as  x,  for  example.  Two  disadvantages  are  pointed  out.  Each 
use  of  the  thunk  requires  a  general  procedure  call,  which  is  expensive.  Context  specific 
optimizations  such  as  order  of  evaluation  with  respect  to  the  caller  are  not  available. 

2.  Cell  Mode:  A  thunk  is  represented  as  a  pair,  a  flag  and  a  closure  or  value,  depending  on 
the  flag.  The  closure  mode  problems  vanish,  but  some  optimizations  are  precluded  due 
to  their  interaction  with  even  other  optimizations®. 

3.  Optimized  Cell  Mode:  The  delayed  function’s  arguments  are  guaranteed  to  be  unevalu- 
ated  on  entry.  While  (DELAY  (FORCE  x))  can  no  longer  be  rewritten  as  x,  other  opti¬ 
mizations  are  enabled.  The  assertion  about  the  function’s  arguments  increase  greatly  the 
opportunities  for  using  path  analysis. 

4.  Value  Mode:  The  value  is  computed  directly.  This  mechanism  is  included  for  complete¬ 
ness. 

Ordering  in  a  parallel  system,  however,  is  much  less  restrictive,  and  opportunities  for  such 
optimizations  are  significantly  diminished.  In  a  dataflow  system,  for  example,  the  program 
captures  only  a  partial  order  of  the  operations. 

1.2.4  Why  Not  Laziness 

Some  proponents  of  lazy  functional  languages  argue  that  lazy  functional  languages  support 
equational  reasoning,  but  strict  functional  languages  do  not.  Equational  reasoning  allows  func¬ 
tion  definitions  to  be  treated  as  equations  or  identities  in  the  sense  that  a  compiler  can  substitute 
them  without  changing  the  meaning  of  programs.  Consider  the  following  definition  of  the  pair 
selector  head.  We  would  like  to  be  able  to  view  it  as  an  equation  also,  relating  head  with  the 
pair  constructor  cons. 

head  (cons  x  y)  =  x 

*An  example  demonstrating  this  point  would  take  a  great  deal  of  development.  The  interested  reader  should 
consult  [14], 
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In  a  lazy  functional  language,  this  equation  holds  even  if  the  evaluation  of  y  diverges;  since 
y  is  not  needed,  it  would  never  be  evaluated.  Whenever  the  compiler  sees  the  expression  head 
(cons  X  y)  in  a  program,  it  can  substitute  y  without  changing  the  meaning  of  the  program. 
In  a  strict  functional  language,  however,  both  x  and  y  are  evaluated  before  cons  or  head  are 
applied.  If  the  evaluation  of  y  diverges,  we  cannot  proceed,  even  though  the  value  of  y  will  be 
discarded.  If  the  compiler  substitutes  y  for  head  (cons  x  y) ,  it  might  change  the  termination 
behavior  of  the  program. 

All  non-strict  functional  languages,  not  just  lazy  languages,  support  equational  reason¬ 
ing  [11].  In  an  eager  non-strict  functional  language,  the  argument  expressions  can  be  evaluated 
in  parallel  with  each  other  and  in  parallel  with  procedure  application.  Furthermore,  the  notions 
of  ‘"getting  an  answer”  and  “terminating”  are  separated,  and,  as  a  result,  the  equation  makes 
sense.  The  interested  reader  is  referred  to  [33]  for  additional  details. 

Some  proponents  of  lazy  evaluation  claim  that  lazy  evaluators  are  more  efficient  than  eager 
evaluators  [11]  because  they  perform  the  minimum  number  of  reduction  steps  to  find  normal 
form.  But  more  interpretation  is  required  to  decide  which  reduction  is  next,  and  to  decide  if 
a  particular  expression  has  already  been  evaluated.  A  converse  claim  is  of  interest:  if  a  lazy 
interpreter  and  an  eager  interpreter  take  the  same  number  of  reduction  steps  to  reach  an  answer 
(normal  form),  then  the  lazy  interpreter  did  at  least  as  much  total  work  as  the  eciger  interpreter. 
Work  includes  both  reduction  and  interpretation. 

A  lazy  evaluator  is  necessarily  more  complex  and  therefore  more  expensive  than  an  eager 
evaluator,  as  pointed  out  in  [13].  This  expense  is  reduced  by  strictness  analysis  [25],  and  other 
similar  techniques  [14].  Most  programs  need  none  of  that  power,  however.  Even  programs 
requiring  lazy  evaluation  need  it  for  only  a  small  fraction  of  the  program.  This  assertion  will 
be  demonstrated  by  overwhelming  evidence. 

Control  Structures  Versus  Data  Structures 

Lazy  evaluation  might  leave  expressions  unevaluated  in  two  ways  —  both  arguments  to  proce¬ 
dures  and  data  structure  slots  may  be  left  unevaluated.  The  computation  of  an  actual  argument 
to  a  procedure  may  diverge,  yet  the  value  of  that  argument  may  not  be  needed.  This  is  the 
situation  if  we  treat  conditionals  as  procedures  and  consider  recursive  definitions.  Consider  the 
following  Id  definitions^,  typeof  declares  the  type  of  an  identifier.  No  type  declarations  are 

^All  code  in  this  document  is  written  in  Id  unless  otherwise  marked. 
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required  by  Id;  they  are  included  for  documentation  purposes. 

typeof  my_if  =  B  ->  *0  ->  *0  ->  *0; 
def  my_if  p  c  a  =  if  p  then  c  else  a; 

typeof  fact  =  N  ->  N; 
def  fact  n  =  if  n<2  then  1 

else  n  ♦  fact  (n-1); 

If  we  replace  if  by  my_if  in  the  preceding  definition  of  fact,  calls  to  fact  will  not  terminate. 
In  the  case  of  data  structures,  an  “infinite  object”  can  never  be  fuUy  expanded,  yet  these 
objects  are  sometimes  convenient  for  programming.  With  the  notable  exception  of  the  condi¬ 
tional  construct,  we  hypothesize  that  the  lazy  evaluation  of  control  structures  is  rarely  needed, 
and  the  lazy  evaluation  of  data  structures  is  needed  infrequently.  The  utility  of  our  approach, 
which  disallows  the  former  and  facilitates  the  latter,  can  only  be  measured  by  its  practical 
effectiveness.  Chapter  4  examines  the  strengths  and  weaknesses  of  this  choice. 

In  concentrating  our  effort  on  data  structures,  we  are  not  alone.  Miranda’s  strictness  an¬ 
notations  apply  only  to  data  structures.  In  Warren  Burton’s  variation  of  Miranda  where  pro¬ 
cedures  are  called  s  trictly  and  data  structures  are  constructed  lazily  to  varying  degrees,  based 
on  annotations.  Conversely,  Multilisp’s  futures  are  oriented  around  expressions. 

Explicit  Allocation  of  Storage 

While  the  producer  and  consumers  of  the  value  of  an  expression  have  no  storage  automatically 
and  naturally  associated  with  them,  the  producers  and  consumers  of  a  data  structure  meet  at 
the  data  structure  itself.  The  data  structure  provides  a  meeting  place,  and  the  data  structure 
operations  provide  a  point  in  time  for  orchestrating  the  delaying  and  forcing  of  expressions. 

Sequentialization  and  Parallelism 

The  main  source  of  paxallelism  in  functional  languages  is  the  ability  to  evaluate  all  arguments 
to  all  procedures  in  parallel.  Strict  functional  languages  require  barrier  synchronization  to 
insure  that  all  arguments  are  computed  before  a  procedure  is  called.  Similarly,  lazy  functional 
languages  apply  procedures  before  evaluating  arguments,  a  different  form  of  sequentialization. 
In  both  of  these  cases,  a  compiler  can  relax  ordering  restrictions  if  it  can  prove  that  the  semantics 
of  the  program  are  the  same.  Both  ends  of  the  spectrum,  however,  are  inherantly  sequential, 
and  sequentialization  comes  at  the  price  of  parallelism. 
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1.2.5  Our  Approach 

On  one  extreme  is  dataflow,  an  eager  approach  amenable  to  parallel  execution.  On  the  other 
extreme  is  the  more  powerful  lazy  approach.  We  examine  an  intermediate  approach  on  the 
eager /lazy  spectrum  that  offers  most  of  the  power  of  la^y  evaluation,  and  the  efficient  parallel 
implementation  that  comes  with  dataflow. 

If  we  only  delay  e.xpressions  explicitly  associated  with  data  structures,  an  interesting  com¬ 
promise  is  achieved.  The  TTD.4,  the  first  architectural  model  for  dynamic  dataflow,  already 
synchronizes  array  producers  and  consumers  in  hardware  using  I-structure  memory  [9,  10,  23]. 
A  similar  synchronization  mechanism  is  required  to  support  delayed  expressions  that  sit  in 
data  structure  slots.  By  generalizing  I-structures,  we  can  support  both  demand  propagation 
and  producer/consumer  synchronization  in  hardware. 

We  develop  lazy  data-structures  for  the  dataflow  language  Id.  An  expression  destined 
for  a  lazy  data-structure  slot  remains  unevaluated  until  the  slot  is  read,  he.,  until  the  value 
of  the  expression  is  requested.  Lazy  structures  in  an  otherwise  eager  system  thus  provide 
a  combination  of  eager  and  lazy  evaluation.  Lazy  data-structures  admit  programming  with 
infinite  data-structures  and  programming  with  data  structures  with  expensive  slots. 

The  language  we  develop  in  this  thesis,  which  we  name  Id#,  is  eager  and  non-strict,  The 
evaluation  of  certain  expressions  which  are  always  associated  with  data  structures  can  be  de¬ 
layed. 

The  language,  compiler,  and  run-time  system  extensions  discussed  in  this  thesis  have  been 
implemented*.  Our  graph  interpreter  [33]  has  been  extended  as  well,  providing  an  opportu¬ 
nity  for  experimentation.  These  extensions  are  currently  being  installed  on  our  first  hardware 
prototype  processor  [38]. 

1.3  Force  and  Delay 

In  this  section,  we  discuss  the  applicability  of  Henderson’s  Force  and  Delay  model,  a  language 
level  solution  that  is  the  basis  for  many  lazy  interpreters.  Force  and  delay  are  sufficient  to 
implement  a  lazy  functional  language  over  an  eager  interpreter  [24],  and  Id  can  aurcommodate 
these  with  higher-order  procedures.  However,  there  is  no  natural  way  to  provide  “memoization” 
within  a  functional  framework. 

*There  are  a  small  number  of  exceptions. 
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Procedural  abstraction  cannot  be  used  to  define  delay  in  an  eager  interpreter,  as  the  ar¬ 
gument  would  be  evaluated  before  it  were  passed  to  the  delay  “procedure”.  “=>”  represents 
a  macro-style  source-level  program  transformation.  A  macro  or  special-form  would  be  used  in 
Scheme  to  achieve  the  desired  behavior.  Consider  the  following  Scheme  definitions  of  delay 
and  force  which,  in  Hudak’s  terminology,  stores  thunks  in  cell  mode: 


;  delay  and  force  in  Scheme 

(delay  <exp>)  =>  (cons  ’delayed  (lambda  ()  <exp>)) 

(define  (force  delayed-exp) 

(if  (eq  ’evaluated  (car  delayed-exp)) 

test  flag 

(cdr  delayed-exp) 

get  memoized  value 

(let*  ((delay-function  (cdr  delayed-exp)) 

get  thunk 

(evaled-exp  (delay-function))) 

evaluate  thunk 

(set-cdr!  delayed-exp  evaled-exp) 

memoize  result 

(set-car!  delayed-exp  ’evaluated) 

set  flag 

evaled-exp))) 

Scheme 

A  thunk  is  stored  in  a  cons  cell.  The  car  indicates  whether  or  not  an  expression  has  been 
evaluated.  The  expression  is  captured  in  an  unapplied  function  and  stored  in  the  cdr  part.  The 
first  time  a  delayed  expression  is  forced,  the  value  is  remembered,  and  the  flag  is  changed. 

There  are  several  problems  in  implementing  such  a  scheme  on  a  parallel  machine  as  syn¬ 
chronization  is  required:  the  force  procedure  must  manipulate  the  delayed  object  atomically, 
and  the  flag  and  data  change.  There  are  no  facilities  to  support  this  behavior  in  Id  --  the 
model  must  be  extended  to  allow  the  desired  behavior. 

We  can  introduce  a  semaphore  and  the  ability  to  change  the  flag  and  data.  The  semaphore 
would  be  acquired  before  manipulation  of  the  delayed  expression  began.  Unfortunately,  this 
requires  extra  time  to  manipulate  the  semaphore  as  well  as  space  to  maintain  it.  The  time 
overhead  could  be  eliminated  by  using  the  flag  for  synchronization  as  well  as  indicating  the 
evaluation  status  of  the  expression.  This  is  possible  if  we  allocate  space  for  both  the  delayed 
and  the  evaluated  expre.ssions,  as  expressed  in  Id  below.  Exchange  is  an  atomic  operation  that 
places  a  value  in  a  structure  slot  and  returns  the  previously  stored  value,  evaluation-flag  is 
a  new  enumerated  type.  Considei  the  following  exlendeu  Id  definitions  of  delay  and  force. 
The  bodies  of  both  delay  and  force  are  let  expressions,  consisting  of  bindings  and  a  result 
expression  (after  the  in).  All  variables  bound  in  a  let  expression  are  lexically  scoped,  as  arc  all 
variables  in  Id. 
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'/,  force  and  delay  using  the  flag  for  synchronization 

y.  We  use  =>  since  there  are  no  macros  or  special-forms  in  Id 

type  evaluation.flag  =  delayed  I  evaluated; 

delay  <exp>  => 

{delayed.exp  =  I_array  (0,2); 
delayed_exp[0]  =  delayed;  X  flag  stored 
def  delay  trigger  =  <exp>;  X  thunk  created 
delayed_exp[l]  =  delay  X  thunk  stored 

in 

delayed.exp}; 


def  force  delayed.exp  = 

{flag  =  exchange  delayed.exp  0  evaluated  X  set  and  test  flag 
in 


if  flag  ==  eveiluated  then 
delayed.exp [2] 
else 

{delay .function  =  delayed.exp [1] ; 
evaled.exp  =  delay.f unction  0; 
delayed.exp [2]  =  evaled.exp 
in 

evaled.exp}}; 


X  get  memoized  value 

X  get  thunk 
X  evaluate  thunk 
X  memoize  result 


This  solution  is  very  close  to  the  standard  one,  and,  as  such,  has  the  standard  inefficiencies. 
We  have  not  taken  advantage  of  our  ability  to  influence  the  architecture.  A  similar  synchro¬ 
nization  problem  has  already  been  solved  by  I-structure  memory  in  synchronizing  producers 
and  consumers  of  data  structures.  Since  a  small  number  of  states  are  required  to  capture  the 
above  behavior,  we  can  solve  our  new  problem  efficiently  by  augmenting  the  hardware. 

We  do  not  axlopt  a  source-to-source  Henderson-stylc  system.  The  framework  we  develop, 
however,  is  powerful  enough  to  embed  such  a  system,  and  we  will  discuss  this  embedding  in 
Section  5.3.2. 


1.4  Background:  Id  and  Dataflow 

We  assume  the  reader  is  familiar  with  functional  languages  and  graph  reduction.  In  the  following 
sections,  we  give  a  brief  introduction  to  the  language  Id  and  dataflow  computing. 
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1.4.1  The  Id  Language 

Id  was  developed  at  the  University  of  California  at  Irvine  [5]  and  has  evolved  through  several 
revisions  at  MIT’s  Laboratory  for  Computer  Science.  The  research  in  this  thesis  was  coincident 
with  the  development  of  the  latest  version  [34]  and,  as  a  result,  some  of  the  “new  ideas” 
presented  herein  are  already  in  the  current  language  document. 

Id  is  an  eager  non-strict  declarative  language  that  supports  higher-order  procedures.  Id’s 
single  assignment  syntax  will  be  introduced  as  we  proceed. 

Id  programs  are  compiled  into  dataflow  graphs  which  capture  the  data  dependences  of  the 
program  [43];  nodes  correspond  to  operations,  and  arcs  correspond  to  data  dependences.  The 
dataflow  graphs  can  be  executed  directly  on  dataflow  machines. 

1.4.2  Dataflow  Machines 

Dataflow  machines  provide  a  vehicle  for  the  execution  of  dataflow  programs.  Parallel  architec¬ 
tures  designed  to  provide  cheap  synchronization  and  tolerate  long  mem'-.y  latencies,  TTDA  [7] 
and  Monsoon,  an  Explicit  Token  Store  machine  [38],  are  tagged  token  dataflow  machines  as 
they  support  general  purpose  computation. 

Values  travel  cdong  arcs  of  the  dataflow  graph  as  tokens,  and  the  machine  enables  operation 
nodes  for  execution  by  detecting  the  arrival  of  a  matching  pair  of  tokens. 

TTD.4  and  Monsoon  support  data  structures  with  I-structure  memory  [9]  which  provides,  in 
addition  to  the  usual  memory  operations,  operations  that  are  useful  for  parallel  computing.  For 
example,  producers  and  consumers  can  be  synchronized  on  a  per  element  basis:  if  a  consumer 
arrives  before  a  producer,  the  consumer  is  delayed  automatically  until  a  value  arrives. 

1.5  Other  Dataflow  Approaches 

We  discuss  two  other  approaches,  Pingali’s  and  Amamiya’s,  for  achieving  laziness  in  a  dataflow 
environment. 

1.5.1  Pingali’s  Demand-Driven  Interpreter 

Pingali  proposes  a  source-to-source  program  transformation  for  achieving  lazy  behavior  within 
an  eager  interpreter  [40].  Although  we  take  a  different  direction,  Pingali’s  work  provided  both 
the  semantic  base  and  the  motivation  for  the  work  in  this  thesis. 
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Pingali’s  approach  offers  the  power  of  a  lazy  interpreter  within  the  framework  of  dataflow, 
but  it  is  difficult  to  implement  in  practice.  We  briefly  describe  Pingali’s  approach  and  the 
problems  that  arise  from  it. 

Each  program  graph  is  overlaid  by  a  complementary  demand  graph  that  propagates  tokens 
representing  demands  explicitly.  A  fork  in  a  dataflow  graph  duplicates  a  token  so  that  it  may  be 
sent  to  more  than  one  consumer.  In  Figure  1.2,  which  depicts  a  transformed  fork  (a  fork  along 
with  its  demand  graph),  the  program  graph  “points  down”  and  the  demand  graph  “points  up”. 
The  bow-tie  shaped  nodes  pass  along  the  data  tokens  (the  ones  going  down)  when  both  the 
data  and  demand  tokens  arrive.  The  d-union  node  is  a  consuming  merge:  it  forwards  the  first 
token  it  receives  and  discards  the  rest. 

Suppose  the  result  of  a  subexpression  is  shared,  but  not  all  consumers  demand  the  value. 
The  fork  that  distributes  the  value  is  left  with  residual  tokens  —  one  for  each  inactive  fork  arm. 

yl  =  y2  =  X  input:  X  outputs:  yl,  y2  input-demand:  x-d  output-demands:  yl-d,  y2-d 


phase  1:  phase  2:  phase  3:  phase  4: 

demand  arrives  demand  propagated  value  arrives  value  propagated 

Figure  1.2:  Dynamic  Behavior  of  a  Demand-Driven  Fork 

Figure  1.2  illustrates  the  fork  problem.  In  Phase  1,  a  demand  arrives  for  y2  on  y2-d.  The 
demand  token  is  duplicated;  one  copy  waits  at  the  gate  on  the  right,  and  the  other  propagates 
through  the  d-union  to  x-d  (Phase  2).  Eventually  x  arrives  (Phase  3).  The  value  token  is 
duplicated;  one  copy  meets  its  partner  at  the  right  hand  gate  and  passes  on  as  y2,  and  the 
other  waits  at  the  left  hand  gate,  in  case  yl-d  ever  arrives  (Phase  4).  If  yl  is  never  demanded, 
a  token  remains  in  the  graph.  A  residual  token  in  a  dataflow  graph  acts  like  a  pointer  to  the 
enclosing  procedure  invocation  frame  and  prevents  it  from  being  reclaimed. 

The  repercussions  of  the  unclaimed  resources  are  more  severe  than  they  may  seem  at  first. 
The  delaying  environment  must  be  maintained  until  the  delayed  expression  is  evaluated.  Even 
if  it  is  evaluated  at  some  point,  the  lifetime  of  the  delayed  expression  may  be  independent  of 
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the  lifetime  of  the  environment  in  which  it  was  generated,  in  which  case,  the  delayed  expression 
will  drag  the  delaying  environment. 

This  lifetime  coupling  problem  cannot  be  ignored:  if  values  are  always  demanded  by  all 
possible  consumers,  lazy  evaluation  saves  us  nothing.  Similarly,  infinite  streams  always  have 
unevaluated  tails.  By  taking  a  restricted  view  of  lazy  evaluation,  we  can  deal  with  these  issues. 

1,5.2  Amamiya’s  Approach 

Amamiya  implements  lazy  data-structures  by  putting  gates  at  the  inputs  of  the  subgraphs  to 
be  delayed  [2].  When  the  slots  are  demanded,  a  token  is  sent  back  into  the  graph,  allowing  the 
computation  to  proceed.  Amamiya’s  graphs  also  suspend  indefinitely  when  a  delayed  slot  is 
never  demanded. 

Amamiya  also  describes  a  mechanism  [3]  very  similar  to  the  lazy  data-structures  presented 
in  this  thesis.  However,  he  seems  to  imply  a  stronger  result.  In  our  system,  data  structure  slots 
are  evaluated  when  a  value  is  requested,  but  not  necessarily  required.  In  a  lazy  system,  which 
Amamiya  is  claiming  his  to  be  (it  appears,  although  his  terminology  is  confusing),  the  contents 
of  a  data  structure  slot  can  be  read  and  then  thrown  away  without  causing  the  evaluation  of 
the  delayed  expression.  We  stress  the  importance  of  the  distinction. 

Amamiya  also  indicates  that  cells  used  to  provide  demand  synchronization  can  be  allocated 
statically  and  “. . .  be  free  of  the  runtime  memory  allocation  ...  ”,  implying  static  deallocation. 
But,  this  is  not  statically  determinable  in  general.  If  we  can  tell  when  a  delayed  expression  is 
evaluated,  we  can  deduce  that  it  is  evaluated,  and  need  not  be  delayed. 

1.6  Overview  of  Thesis 

In  the  remaining  chapters  of  the  thesis,  we  develop  lazy  data-structures,  a  limited  form  of 
lazy  evaluation  that  extends  the  programming  language  Id,  along  with  an  implementation  that 
naturally  and  efficiently  employs  architectural  support.  The  language  will  not  be  as  expressive 
as  a  lazy  functional  language,  and  we  will  consider  the  limitations. 

In  Chapter  2,  we  present  Id#,  Id  plus  lazy  data-structures.  We  discuss  the  implementation 
in  Chapter  3.  Chapter  4  presents  the  programming  methodology  for  using  lazy  data-structures, 
and  discusses  the  expressive  power  and  shortfalls.  Chapter  5  concludes  with  discussions  of  the 
presented  system  and  future  work. 
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Chapter  2 

A  Language  with  Lazy 
Data-Structures 


Idi^  extends  Id  [37]  by  introducing  lazy  data-structures^ .  Delayed  expressions  are  always  asso¬ 
ciated  with  data  structure  slots. 

Most  non-strict  functional  languages  use  a  lazy  evaluation  rule  as  the  default,  and  some 
allow  the  user  to  specify  eager  evaluation  for  certain  function  arguments.  We  take  a  converse 
approach,  assuming  eager  evaluation,  and  allowing  the  producer  of  a  data  structure  to  indicate 
by  explicit  annotation  that  certain  slots  are  to  be  assigned  lazily.  No  annotation  is  required 
when  a  data  structure  is  consumed;  it  is  transparent  to  the  consumer  of  a  data  structure  whether 
the  slot  was  assigned  eagerly  or  lazily. 

In  this  Chapter  we  present  our  approach  to  lazy  evaluation,  the  syntax  and  semantics  of 
our  language,  and  the  language-related  (abstract)  costs  and  benefits. 

2.1  Approach 

In  Id#,  the  producer  of  a  data  structure  can  delay  the  evaluation  of  the  expression  that  defines 
the  contents  of  particular  fields  of  records  and  the  contents  of  individual  slots  of  arrays.  The 
consumer  of  a  lazy  data-structure  does  not  know  if  a  slot  has  been  delayed,  and  has  no  way  of 
t  illing  if  a  slot  has  been  delayed.  The  contents  are  automatically  forced  if  necessary,  and  the 
computed  value  is  stored  for  later  use. 

It  is  worth  discussing  what  is  meant  by  if  necessary.  A  delayed  expression  resident  in  a  data 
structure  slot  is  evaluated  whenever  the  contents  are  requested  through  a  fetch,  regardless  of 

'Lazy  data-strnctures  have  already  been  incorporated  into  “current  Id”.  We  use  “Id”  to  refer  to  the  language 
with  no  lazy  data-structures,  and  “Id#”  to  refer  to  the  language  cum  lazy  data-structures. 
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whether  or  not  the  value  is  actually  thrown  away  or  used  in  the  computation.  This  point  will 
be  expanded  presently  by  example. 

Assignments  to  individual  slots  are  annotated  by  the  programmer  for  lazy  evaluation,  and 
aU  assignments  to  unmarked  slots  are  evaluated  eagerly. 

As  we  have  noted,  Id  is  not  a  functional  language.  A  large  subset  of  Id  is  functional,  though. 
The  only  non-functional  construct  in  Id  is  the  I-structure^.  Note,  however,  that  I-structures 
preserve  the  determinacy  of  Id.  Although  we  are  mostly  interested  in  the  functional  subset  of 
Id,  we  also  include  I-structures  in  for  completeness.  The  consequences  the  non-functional 
cispect  will  be  discussed. 

2.2  Syntax 

There  are  four  types  of  data  structures  in  Id#:  tuples,  algebraic  types^  (including  lists),  arrays 
(these  first  three  types  are  functional),  and  I-structures  (non- functional).  We  consider  syntax 
for  creating  each  of  these  with  lazy  components.  After  discussing  algebraic  types,  but  before 
discussing  arrays,  we  present  two  syntactic  sugars  for  producing  lists:  arithmetic  sequences  and 
list  comprehension. 

2.2.1  Tuples 

An  expression  in  a  tupling  construct  can  be  preceded  by  a  “#”  to  indicate  that  it  is  to  be 
delayed.  In  the  following  binding,  the  second  and  third  tuple  slots  are  delayed: 

a_ tuple  =  expl,  #  exp2,  #  exp3,  exp4; 

Tuples  are  accessed  by  pattern  matching,  and  lazy  slots  are  evaluated  when  tuples  are 
destructured.  Consider  the  following  definition  that  contains  a  let  block  that  returns  a  three¬ 
tuple.  Underscore  (“-”)  is  used  in  a  destructuring  pattern  as  a  place  holder  when  the  value 
associated  with  a  position  is  to  be  ignored. 

consume.tuple  = 

{  el,e2,.,e4  »  a_tuple 
in 

el,G2,e4}; 

*The  term  “I-structure”  refers  both  to  a  language  construct  and  to  an  implementation  level  construct.  In  this 
chapter,  “1-structure”  refers  to  the  language  construct,  unless  otherwise  noted. 

^Technically,  tuples  may  be  considered  algebraic  types.  We  separate  them  for  expository  purposes. 
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el,  e2,  and  e4  all  need  values,  and,  as  a  result,  exp2  is  evaluated,  while  exp3  remains 
delayed.  Even  if  e2  did  not  appear  in  the  return  expression  or  if  e2  were  not  used  by  the  caller 
of  consinne.tuple,  exp2  would  be  evaluated. 

2.2.2  Algebraic  Types 

The  other  heterogeneous,  unsubscripted  types  are  collectively  called  algebraic  types.  Cons  (the 
list  constructor)  is  an  example  of  a  built-in  algebraic  type.  Conses  are  constructed  with  an  infix 
By  using  the  “#”  annotation,  either  the  head  or  tail  (or  both)  can  be  delayed.  Consider 
the  following  code: 

some.conses  = 

{cl  *  expl  :  exp2: 

c2  *  exp3  :  iexp4: 

c3  s  #exp5  ;  exp6: 

c4  *  #exp7  :  iexp8 

in 

cl,c2,c3,c4}; 

cl  is  a  normal  cons:  both  the  head  and  tail  are  evaluated  eagerly.  c2  has  an  eager  head 
and  a  lazy  tail,  and  c3  has  a  lazy  head  and  an  eager  tail.  c4  is  a  lazy  cons:  both  the  head  and 
tail  are  evaluated  lazily. 

Eager  and  lazy  slots  of  cons  cells  are  accessed  uniformly  by  pattern  matching,  oblivious 
to  the  method  of  zissignment.  When  a  lazily  assigned  slot  is  accessed,  the  computation  is 
performed  and  the  value  is  returned.  Consider  the  following  binding: 

consume.conses  = 

{  cl,c2,c3,c4  *  some.conses; 
e3:e4  =  c2; 

_  :e6  =  c3; 
e7 : _  =  c4 

in 

e3,e4,e6}; 

e3,  e4,  e6,  and  e7  all  need  values,  and,  as  a  result,  exp4  and  exp7  are  evaluated  (even 
though  e7  is  not  returned),  but  expS  and  exp8  remain  delayed. 

Similarly,  user-defined  algebraic  types  can  have  lazy  slots.  Expressions  in  algebraic  type 
constructors  can  be  preceded  by  a  “#”.  In  the  following  code,  the  foo  type  is  introduced  with 
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two  constructors,  Bar  and  Bauz.  The  call  to  make_complex_foo,  which  is  destined  for  the  first 
slot  of  a  Baz  type  f  oo,  is  delayed.  “#”  binds  less  tightly  than  procedure  application. 

type  foo  =  Bar  |  Baz  foo  N; 

Baz  (#  make_a_complex_foo  a  b  c)  17 

The  first  component  of  the  above  expression  is  delayed.  This  technique  is  especially  useful 
for  defining  recursive  data  structures.  And,  once  again,  algebraic  types  are  accessed  using 
pattern  matching,  with  delayed  slots  being  forced  implicitly  on  selection. 

Since  cons  is  such  an  important  constructor,  we  introduce  special  syntax  for  delaying  its 
arguments.  The  “#”  that  would  normally  precede  the  first  argument  to  cons  may  succeed  it  as 
follows: 

same.conses  » 

{  cl  =  expl  :  sxp2: 
c2  =  exp3  :#  exp4: 
c3  a  expS  i:  exp6; 
c4  a  exp7  f:i  exp8 
in 

cl,c2,c3,c4}; 

This  variation  allows  us  to  think  of  four  infix  cons  operators:  normal  cons  stream 

cons  «.e.,  tail-lazy  cons),  head-lazy  cons  (“i:”),  and  head-lazy  stream  cons  i.e., 

fully-lazy  cons). 

2.2.3  Arithmetic  Sequences  and  List  Comprehension 

Id#  has  special  syntax  for  generating  lists  called  arithmetic  sequences  and  list  comprehension. 
Arithmetic  sequences  (like  Miranda’s  dotdot  notation  [44])  are  convenient  ways  of  expressing 
a  range  of  integers.  List  comprehension  (based  on  Miranda’s  list  comprehension  [44])  provides 
a  compact  way  of  specifying  more  complex  lists.  Arithmetic  sequences  and  list  comprehension 
are  based  on  Zermelo-Frankel  set  notation. 

Arithmetic  Sequences 

Arithmetic  sequences  are  language  idioms  which  denote  ascending  or  descending  lists  of  numbers 
with  a  constant  first-difference.  Consider  the  arithmetic  sequence  for  generating  the  list  of 
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integers  from  one  to  ten,  and  the  arithmetic  sequence  for  generating  the  list  of  integers  from 
ten  to  one.  Each  binding  is  followed  by  a  pseudo-Id  expression  that  gives  its  meaning  (after 
the  “*/,=”).  is  Id’s  comment  character.  associates  to  the  right. 

typeof  one2ten  =  list  N; 
one2ten  =  1  to  10; 

•/,=  1  :  2  :  3  :  ...  :  10  :  nil 

typeof  ten2one  =  list  N; 
ten2one  =  10  downto  i; 

10  :  9  :  8  :  ...  :  1  :  nil 

The  “<exp>  to  <exp>  by  <exp>”  idiom  allows  arithmetic  sequences  with  a  computed  first 
difference  to  be  specified  conveniently. 

typeof  odds_between_l_and_20  =  list  N; 
odds_batween_l_and_20  =  1  to  20  by  2; 
y.=  1  :  3  :  5  :  ...  :  19  :  nil 

These  behaviors  can  be  captured  in  Id#  “library  routines”  as  follows.  An  is  prefixed  to 
all  functions  to  preclude  confusion  with  any  keywords. 

typeof  _to  =  N  ->  N  ->  (list  N) ; 
def  _to  lo  hi  = 
if  lo  >  hi  then 
nil 
else 

lo  :  _to  (lo+l)  hi; 


typeof  _to_by  =  N  ->  N  ->  N  ->  (list  N) ; 
def  _to_by  lo  hi  step  = 
if  lo  >  hi  then 
nil 
else 

lo  :  _to_by  (lo+step)  hi  step; 

These  facilities  existed  in  Id.  Now  we  extend  them  to  allow  us  to  conveniently  express  infinite 
arithmetic  sequences,  upfrom  <exp>  and  downfrom  <exp>  are  new  idioms  that  denote  infinite 
arithmetic  sequences  of  integers,  the  former  ascending,  and  the  latter  descending.  Consider  the 
following  binding  for  the  integers.  associates  to  the  right. 
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typeof  ints  =  list  N; 
ints  =  upfrom  1; 

•/.=  1  :#  2  :i  3 

Both  upfrom  and  downfrom  can  be  used  in  conjunction  with  the  by  keyword  to  vary  the 
step  size  of  the  sequence.  Consider  the  stream  of  odd  integers: 

typeof  odds  =  list  N; 
odds  =  upfrom  1  by  2; 

•/,=  1  :#  3  :#  5  :i  ... 

These  behaviors  can  be  expressed  in  Id:;^.  Each  succeeding  recursion  is  buried  in  the  de¬ 
layed  tail  of  a  stream  cell.  Consider  the  following  definitions  for  .upfrom  and  for  _upfrom_by. 
Definitions  for  .downfrom  and  for  .downfromJ)y  are  similar. 

typeof  .upfrom  =  N  ->  (list  N) ; 
def  .upfrom  n  =  n  :#  .upfrom  (n+1); 

typeof  .upf rom.by  =  N  ->  N  ->  (list  N) ; 

def  .upf rom.by  n  step  =  n  :i  .upfrom.by  (n+step)  step; 

Even  though  the  “by  expression”  can  be  variable,  it  is  constant  with  respect  to  the  sequence; 
it  is  evaluated  once,  and  the  value  is  used  repeatedly. 

Infinite  arithmetic  sequences  point  to  a  new  opportunity,  the  possibility  of  performing  spec¬ 
ulative  computation.  Each  time  we  reach  “the  current  end”  of  an  infinite  arithmetic  sequence, 
we  can  extend  it  by  more  than  one  element.  The  unwind  keyword  gives  the  programmer  control 
over  the  amount  that  the  stream  expands  each  time  its  tail  is  forced.  Consider  the  following 
binding  for  the  integers  that  expands  three  slots  at  a  time. 

typeof  ints.unwind3  =  list  N; 
ints.unwind3  =  upfrom  1  unwind  3; 
y.=  1  :  2  :  3  :#  4  :  5  :  6  :i  ... 

Abstractly,  if  we  are  unwinding  an  infinite  arithmetic  sequence  by  three,  each  time  the 
delayed  tail  of  the  sequence  is  reached,  three  elements  are  produced.  By  default,  arithmetic 
sequences  unwind  by  one. 

typeof  _upf rom.unwind  =  H  ->  N  ->  (list  N) ; 
def  .upfrom.xmwind  lo  unwind  « 

{def  .upfrom.tmwind.  lo  count  » 
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if  count  ==  1  then 

lo  :i  _upfrom_unwind_  (lo+l)  unwind 
else 

lo  :  _upfrom_unwind_  (lo+l)  (count-1); 
in 

_upfrom_unwind_  lo  \inwind}; 

It  is  possible  to  apply  unwinding  to  finite  arithmetic  sequences  as  well,  but  there  is  a 
small  complication,  as  the  unwinding  may  not  be  finished  when  the  list  ends.  Suppose  we  are 
unwinding  a  finite  arithmetic  sequence  by  three  elements  at  a  time.  Each  time  the  delayed  tail 
of  the  stream  is  reached,  up  to  three  elements  are  produced.  Fewer  than  three  elements  are 
produced  if  the  list  ends,  as  we  see  in  the  following  example: 

typeof  one2f our .unwinds  =  list  N; 

one2f our .unwinds  =  1  to  4  luiwind  S; 

7.=  1  :  2  :  S  :i  4  ;  nil 

Unwinding  can  be  used  in  combination  with  the  by  keyword.  Id  definitions  for  _to_imwind, 
-downto-unwind,  _to_by .unwind,  etc.,  are  omitted. 

All  of  the  previous  “library  routines”  were  defined  in  terms  of  recursive  procedures.  When¬ 
ever  unwinding  is  present,  a  routine  can  be  defined  more  efficiently  using  Id#’s  looping  construct 
in  conjunction  with  a  data  structure  called  an  “open  list”.  Open  lists  (similar  to  Prolog’s  “dif¬ 
ference  lists”  [17])  use  I-structures  to  define  a  list  by  successively  appending  elements  to  the 
“open  slot”  at  the  end”*.  Arithmetic  sequences  that  unwind  one  element  at  a  time  such  as 
upfrom  1  do  not  benefit  from  this  opportunity. 

List  Comprehension 

List  comprehension  is  based  on  Zermelo-Frankel  set  notation  and  allows  the  programmer  to 
conveniently  express  lists  that  are  generated  by  mapping  and  filtering  over  other  lists. 

In  a  list  comprehension,  an  expression  is  evaluated  in  a  sequence  of  binding  environments, 
and  the  results  are  collected  in  order.  A  list  comprehension  begins  with  The  following 

list  comprehension  denotes  the  list  of  integers  from  one  to  ten. 

typeof  one2ten  *  list  N; 

one2ten  *  i  ||  i  <-  1  to  10}; 

7.=  1  :  2  :  3  :  ...  :  10  :  nil 

’Examples  of  the  use  of  open  lists  as  well  as  additional  discussion  can  be  found  in  [6]. 
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i  <-  1  to  10  generates  a  sequence  of  binding  environments  in  which  i  is  bound  to  1,  2, 
. . . ,  10  and  is  called  a  generator. 

More  generally,  any  expression  can  be  evaluated  in  the  sequence  of  generated  binding  envi¬ 
ronments.  binds  less  tightly  than  procedure  application. 

{:  f  i  II  i  <-  is} 

•/.=  (f  il)  :  (f  i2)  :  (f  i3)  :  ...  :  (f  in)  :  nil 

*/.-  f  il  :  f  i2  :  f  i3  :  .  .  .  :  f  in  ;  nil 

where  ij  is  the  jth  element  of  the  list  is. 

Filtering  can  be  accomplished  by  associating  the  when  or  unless  keywords  with  a  generator. 
The  following  list  comprehension,  which  assumes  the  existence  of  a  predicate  to  test  integers 
for  primality,  denotes  the  list  of  primes  below  twenty: 

typeof  prime?  =  M  ->  B; 

typeof  primes_below_20  =  list  N; 

primes_below_20  =  {:  i  ||  i  <-  1  to  20  when  prime?  i}; 

■/.=  2  :  3  :  5  :  ...  :  19  :  nil 

Several  generators  may  be  present,  in  which  case  the  sequence  of  binding  environments  is 
given  by  a  row  major  order  traversal  of  the  cross  product  of  the  generated  environments.  Each 
inner  environment  can  use  names  defined  in  an  outer  scope.  The  following  list  comprehension 
denotes  the  lower  right  triangle  of  a  three-by-three  grid: 

typeof  grid  =  list  (N,M); 

grid  *  {:  (i.j)  II  i  <-  1  to  3  ft  j  <-  1  to  i}; 

7.=  (l,l):(2,l):(2,2):(3,l):(3,2);(3,3)T.il 

Just  as  the  generation  of  lists  is  supported  with  list  comprehension  syntax,  the  generation 
of  streams  is  supported  with  stream  comprehension  syntax.  A  stream  comprehension  begins 
with  as  a  stream  is  a  tail-lazy  list,  and  is  tail-lazy  cons  (stream  cons).  Consider 

the  following  stream  comprehension  for  the  stream  of  integers  from  one  to  ten: 

typeof  one2ten  =  list  N; 

one2ten  =  {:f  i  ||  i  <-  1  to  10}; 

7.*  1  :#  2  ;t  3  :i  .  . .  :i  10  :i  nil 

As  in  list  comprehension,  any  expression  can  be  evaluated  in  the  sequence  of  generated 
binding  environments.  “:i”  binds  less  tightly  than  procedure  application. 
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{:#  f  i  II  i  <-  is} 

X=  f  il  :i  f  i2  :i  f  i3  :i  ...  :#  f  in  :#  nil 


where  ij  is  the  jth  element  of  the  list  is. 

As  in  list  comprehension,  filtering  can  be  accomplished  by  associating  the  when  or  unless 
keywords  with  a  generator.  The  following  stream  comprehension  denotes  the  stream  of  primes 
below  twenty: 

typeof  prime?  =  N  ->  B; 
typeof  primes_below_20  *  list  N; 

primes_below_20  =  {:#  i  1 |  i  <-  1  to  20  when  prime?  i}; 
y,=  2  :i  3  :#  5  :#  .  .  .  :#  19  :#  nil 

Stream  comprehension  offers  several  opportunities  that  are  not  available  to  list  comprehen¬ 
sion.  The  first  new  opportunity  is  the  appbcation  of  program  controlled  unwinding.  Consider 
the  following  definition  which  maps  a  function  over  a  stream,  and  unwinds  two  elements  at  a 
time: 


typeof  smapl_unwind2  =  (N->M)  ->  (list  N)  ->  (list  N) ; 

def  smapl_unwind2  f  is  =  {:#  f  i  unwind  2  ||  i  <-  is}; 

•/.*  f  il  :  f  i2  :#  f  i3  :  f  i4  :#  ...  :#  f  in  :  nil 

By  default,  a  stream  comprehension  unwinds  by  one.  Suppose  we  are  unwinding  a  stream 
comprehension  by  three  elements  at  a  time.  Each  time  the  delayed  tail  of  the  stream  is  reached, 
up  to  three  elements  are  produced.  Fewer  than  three  elements  are  produced  if  the  sequence  of 
binding  environments  ends,  as  we  see  in  the  following  example: 

typeof  one2four_unwind3  =  list  N; 
one2four_unwind3  =  {:#  i  unwind  3  ||  i  <-  1  to  4}; 

'/.=  1  :  2  :  3  ;#  4  :  nil 

Variable  unwinding  is  also  possible: 

typeof  ints_with_varied_unwinding  *  list  N; 
ints_with_varied_unwinding  ■  {:#  i  unwind  i  | |  i  <-  1  to  10}; 

•/,=  1  :#  2  :  3  :#  4  :  5  :  6  :  7  ;i  :  8  :  9  :  10  :  nil 

Exactly  in  which  binding  environment  the  unwinding  expression  is  evaluated  is  very  impor¬ 
tant,  as  the  value  of  the  “unwind  expression”  can  change.  The  “unwind  expression”  is  evaluated 
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in  the  first  binding  environment,  and  reevaluated  each  time  the  stream  suspends  and  resumes. 
In  other  words,  the  “unwind  expression”  is  evaluated  once  for  ecich  “forced  tail”. 

A  new  type  of  unwinding  is  also  possible.  The  while  and  until  keywords  allow  the  stream 
to  unwind  while  or  until  a  predicate  is  satisfied.  The  following  stream  denotes  the  integers  from 
one  to  ten,  and  unwinds  until  it  finds  a  prime  element: 

typeof  prime_imwinder  =  list  N; 

prime .unwinder  =  {:#  i  until  prime?  i  1 1  i  <-  1  to  lO}; 

•/,=  1  :  2  :#  3  :#  4  :  5  :i  6  :  7  :i  :  8  :  9  :  10  :  nil 

The  expression  associated  with  while  or  until  is  evaluated  in  every  binding  environment. 
The  predicate  acts  as  a  post-test,  i.e.,  it  specifies  whether  the  next  element  should  be  produced 
eagerly,  .^t  least  one  stream  element  is  produced,  regardless  of  the  unwinding  controls,  unless 
the  generators  run  out  of  binding  environments. 

Numerical  unwinding  (using  the  unwind  keyword)  and  boolean  unwinding  (using  the  while 
and  until  keywords)  can  be  combined,  in  which  case  the  stream  suspends  if  either  the  unwind 
count  dips  below  one,  or  the  boolean  test  indicates  that  a  suspension  is  in  order. 

A  list  comprehension  can  be  viewed  as  a  stream  comprehension  with  infinite  unwinding. 
Conversely,  a  stream  comprehension  can  end  up  producing  a  data  structure  of  finite  length,  as 
does  a  list  comprehension. 

The  most  interesting  new  opportunity  still  remains:  a  stream  comprehension  can  have 
infinite  generators.  Consider  the  following  stream  comprehension  for  the  squares  of  the  integers, 
which  has  an  infinite  arithmetic  sequence  as  a  generator: 

typeof  squares  =  list  N; 

squares  =  {:#  i“2  | |  i  <-  upfrom  1}; 

7.=  1  :i  4  :#  9  :#  ... 

We  could  enumerate  the  first  octant  (the  grid  points  with  integer  coordinates  in  the  first 
quadrant,  above  and  including  the  i-axis  and  below  the  45°  line)  as  follows: 

typeof  octantl  =  list  (N,N); 

octantl  *  {:#  (x,y)  | |  x  <-  upfrom  1  ft  y  <-  0  to  x-1}; 

y.=  (1.0)  :#  (2,0)  :#  (2,1)  :i  ... 

Me  might  try  naively  to  enumerate  the  first  quadrant  similarly: 
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typeof  quadrantl  =  list  (N.N); 

quadrantl  =  {:#  (x,y)  I  I  x  <-  upfrom  1  ft  y  <-  upfrom  0}; 

*/,=  ??? 

Diagonalization  (i.e.,  fairly  producing  the  cross-product)  is  not  automatic.  The  stream 
quadrantl  will  climb  the  grid  along  the  vertical  line  i  =  1,  and  never  enumerate  any  other 
points. 

quadrantl  =  (1.0)  :i  (1.1)  :#  (1.2)  :#  (1.3)  :#  ... 

Conversely,  Miranda  provides  an  idiom  for  automatic  diagonalization.  In  Miranda,  brackets 
(“[”  and  “]”)  set  off  lists,  and  two  dots  (“.  .”)  indicate  an  integer  range. 

quadrantl  =  [(x.y)  //  x<-[l..];  y<-[0..]] 

•/.=  [(1.0). (1.1). (2.0). (1.2). (2.1). (3.0).  ...  ] 

_ Miranda 

In  the  following  example,  the  unwind  expression  is  variable: 

typeof  octantl_a_coluinn_at_a_tiine  =  list  (N.N); 
octant  l_a_column_at_a_tiine  = 

{:#  (x.y)  unwind  x  II  x  <-  upfrom  1  ft  y  <-  0  to  x}; 

Each  time  we  consume  a  vertical  column  of  the  triangle,  the  next  column  is  computed 
eagerly.  The  unwinding  facility  is  thus  a  tool  for  speculative  computation. 

Two  more  variants  of  list  comprehension  are  of  interest;  head-lazy  list  comprehension, 
and  head-lazy  stream  comprehension,  corresponding  to  the  the  following  analogy: 

::  Put  another  way,  head-lazy  list  comprehension  is  the  head-lazy  version  of 

list  comprehension,  and  head-lazy  stream  comprehension  is  the  head-lazy  version  of  stream 
comprehension.  Head-lazy  lists  are  built  with  cons  cells  with  the  head  being  evaluated  lazily 
and  tail  being  evaluated  eagerly.  Head-lazy  streams  are  built  with  cons  cells  with  both  the  head 
and  tail  being  evaluated  lazily. 

A  head-lazy  list  is  a  list  where  the  spine  (skeleton)  of  the  list  is  expanded  eagerly,  but 
the  elements  are  only  computed  on  demand.  The  head-la^y  list  comprehension,  like  the  list 
comprehension  with  the  initial  being  replaced  by  is  a  convenient  way  to  express 

head-lazy  lists.  In  the  following  code,  the  spine  is  expanded  immediately,  but  the  function  f  is 
not  applied  to  any  elements  until  they  are  requested  by  a  consumer: 
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■[#:  f  X  I  I  X  <-  1  to  10} 

•/.=  f  1  #:  f  2  #:  .  .  .  #:  f  10  i:  nil 


Suppose  the  function  f  is  very  expensive  to  compute;  the  list  produced  is  short;  we  only  need 
some  list  elements;  but  we  cannot  tell  in  advance  which  elements  are  needed.  The  head-lazy 
list  comprehension  is  designed  for  this  scenario,  the  “expensive  slots”  programming  paradigm. 

A  head-lazy  stream  is  one  in  which  both  the  elements  and  the  spine  are  expanded  on  demand. 
The  head-lazy  stream  comprehension,  like  the  stream  comprehension  with  the  initial  being 
replaced  by  is  a  convenient  way  to  express  head-lazy  streams.  In  the  following  code,  the 

head-lazy  stream  is  lazy  in  the  head  and  tail.  The  function  f  is  not  appbed  to  any  elements 
that  are  not  needed,  and  the  tail  expands  on  demand. 

{#:#  f  X  M  X  <-  upfrom  1} 

'/.=  f  1  #;#  f  2  #:#  f  3  #:i  ... 

Suppose  the  function  f  is  very  expensive  to  compute;  we  need  only  a  fraction  of  some  finite 
prefix  of  the  elements;  but  we  cannot  tell  in  advance  which  elements  are  needed.  The  head-lazy 
stream  comprehension  is  designed  for  this  scenario,  which  combines  the  “expensive  slots”  and 
“infinite  structures”  programming  paradigms. 

The  properties  of  lists  (streams)  produced  by  any  of  these  varieties  of  list  comprehension 
are  determined  by  the  type  of  the  constructor,  not  by  the  type  of  an  embedded  generator.  A 
list  (as  opposed  to  a  stream)  with  an  infinite  generator  will  diverge,  as  we  see  in  the  following 
example: 

typeof  diverge  =  list  N; 
diverge  =  {;  x  II  x  <-  upfrom  1}; 

•/,=  1  :  2  :  3  :  ... 

All  the  facilities  described  in  this  section  can  be  mixed  and  matched.  As  comprehension 
is  syntactic  sugar,  all  combinations  can  be  expressed  directly  in  Id.  This  is  most  easily  done 
in  terms  of  recursive  procedures,  but,  as  we  indicated  at  the  end  of  the  previous  section,  more 
efficient  techniques  are  available.  Streams  with  unwinding  as  well  as  streams  with  filters  (when 
and  unless)  are  prime  candidates  for  loop-style  implementations.  Unwinding  can  proceed 
eagerly  until  it  is  time  to  suspend.  If  an  environment  is  discarded  by  a  filter,  the  next  one  can 
be  generated  eagerly.  An  array  of  optimizations  are  possible,  and  a  few  are  described  in  the 
remainder  of  this  section. 
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If  a  generator  is  an  arithmetic  sequence,  no  intermediate  stream  need  be  generated.  The 
state  of  the  environment  can  be  passed  directly  from  one  recursion  (iteration)  to  the  next.  This 
is  especially  important  when  generating  cross  products,  and  many  streams  can  be  avoided. 

Lists  that  are  passed  in  (a  <-  as)  and  finite  generators  require  a  test  for  the  end.  Some 
generators,  however,  are  known  to  be  infinite,  and  an  end  test  can  be  avoided. 

Arithmetic  sequences  and  list  comprehension  provide  support  mainly  for  the  “progiamming 
with  infinite  data-structures  structures”  paradigm.  The  next  two  section,  on  arrays  and  I- 
structures  address  the  “programming  with  expensive  slots”  paradigm. 

2.2.4  Arrays 

In  Id#,  an  array  is  functional  indexed  data-structure.  Each  slot  of  an  array  may  be  assigned 
at  most  once,  and  arrays  can  be  constructed  with  lazy  slots. 

Arrays  are  produced  using  array  comprehension.  An  array  comprehension  declares  the 
bounds  of  the  array  and  has  clauses  that  specify  the  elements  for  regions  of  the  array.  Both 
the  index  expression  and  the  actual  expression  are  evaluated  in  the  specified  sequence  of  bind¬ 
ing  environments.  In  the  following  array  comprehension,  the  array  is  filled  with  integers  in 
descending  order: 

typeof  al  =  array  N; 

a  =  {array  (l,n)  I  [i]  =  n-i+1  ||  i  <-  1  to  n} 

The  index  expression  can  also  be  non-trivial.  The  following  array  comprehension  produces 
an  array  identical  to  the  preceding  one: 

typeof  a2  =  array  N; 

a2  =  {array  (l,n)  |  [n-i+1]  =  i  ||  i  <-  1  to  n} 

Several  clauses  can  be  given  to  fill  several  regions.  The  procedure  identity jnatrix  uses 
array  comprehension  to  produce  the  identity  matrix  of  size  n  by  n: 

typeof  identity.matrix  =  N  ->  (I.matrix  N) ; 

def  identity.matrix  n  = 

{matrix  ((l,n) ,(l,n)) 

I  [i ,  j]  =  0  I  I  i  <-  1  to  n-1  ft  j  <-  i+1  to  n  */,  above  the  diagonal 

I  [i.i]  =  1  I  I  i  <-  1  to  n  •/,  the  diagonal 

I  [i,j]  =  0  I  I  i  <-  2  to  n  ft  j  <-  1  to  i-1};  */,  below  the  diagonal 
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To  have  a  lazy  slot,  “#”  replaces  as  follows; 

typeof  edge  =  N  ->  N; 
typeof  middle  =  N  ->  N; 
typeof  an_2u:ray  =  N  ->  (array  N) ; 
def  aii_  array  n  - 
{  array  (l,n) 

I  [1]  *  edge  1 

I  [i]  #  middle  i  | |  i  <-  2  to  n-1 
I  [n]  #  edge  n  }; 

Slot  1  is  assigned  eagerly,  and  the  rest  of  the  slots  are  computed  only  after  being  selected 
by  an  array  selection  operation. 

typeof  consume.array  =  N  ->  N  ->  N; 
def  consume.array  n  i  = 

{  a  =  an.array  n 
in 

a[i]}; 

2.2.5  I-structures 

I-structures,  as  we  have  noted,  are  non- functional  data  structures.  They  are  indexed  and  may 
have  distributed  definitions.  Each  slot  of  an  I-structure  may  be  assigned  at  most  once.  I- 
stnicture  slots  behave  something  like  Prolog’s  “logic  variables” [4].  I-structures,  like  all  data 
structures  in  Id,  can  be  constructed  with  lazy  slots. 

I-structures  can  be  created  in  one  place  and  filled  in  any  number  of  places,  very  much  like 
Fortran  arrays.  I-structures  slots  are  assigned  lazily  by,  once  again,  replacing  the  by  “#”. 
an_I_structure  has  a  for  loop  that  is  run  purely  for  side  effect®.  Note  how  the  loop  index  i  is 
generated  from  an  arithmetic  sequence  in  the  “comprehension  style”. 

typeof  an_I_8tructure  «  H  ->  (I.array  N) ; 
def  an_I_8tructure  n  » 

{a  *  I.array  (l,n) ; 
a[l]  ■  edge  1; 

{  for  i  <-  2  to  n-3  do  a[i]  i  middle  i}; 
a[n]  i  edge  n 
in 

a}; 

^Id#  loops  can  also  return  a  result. 
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Slot  1  is  assigned  eagerly,  slots  2  through  n-3  and  the  last  slot  are  assigned  lazily,  and  slots 
n-2  and  n-1  are  left  unassigned.  The  contents  of  the  delayed  slots  are  computed  only  after 
being  selected  by  an  I-structure  selection  operation.  In  the  following  code,  slots  n-2  and  n-1 
are  filled  in,  the  former,  eagerly,  and  the  latter,  lazily.  Then  a  slot  is  selected,  which  may  or 
may  not  cause  a  delayed  expression  to  be  evaluated,  depending  on  whether  or  not  the  selected 
slot  was  assigned  lazily  or  eagerly.  The  let  expression  in  consume.! .structure  contains  one 
binding,  and  two  I-structure  assignments. 

typeof  consume.! .structure  =  N  ->  N  ->  N; 
def  consume_!_structure  n  i  = 

{  a  =  an_!_structure  n; 
a [n-2]  =  middle  (n-2); 
a [n-1]  i  middle  (n-1) 
in 

a[i]}: 


2.3  Operational  Semantics 

A  small  extension  to  the  rewrite  rules  found  in  [9]  allows  us  to  capture  the  operational  semantics 
of  our  new  constructs.  Rewrite  rules  are  described  in  a  two  column  format.  The  left  column 
corresponds  to  an  expressions  and  its  binding  environment  (listed  below  the  expression),  and 
the  right  side  is  the  rewritten  expression  along  with  its  bindings.  The  expression  may  change 
as  may  the  bindings,  and  new  bindings  may  be  introduced. 

expression  =>-  expression' 

bindingl  ;  ...  ;  bindingn  bindingl  ;  ;  ...  ;  bindingm 

The  rewrite  rules  given  in  [9]  are  context  sensitive.  For  example,  a  sub-expression  may 
not  be  evaluated  ^within  the  Then  or  Else  arms  of  a  conditional  expression" .  The  rewrite 
rules  may  not  look  inside  certain  expressions,  and  we  make  this  behavior  explicit  by  enclosing 
any  opaque  expressions  in  double  quotes  When  a  conditional  is  first  written  down,  the 

consequent  and  alternate  clauses  are  surrounded  by  quotes  and  are  opaque.  After  the  boolean 
value  is  determined,  the  conditional  is  contracted,  the  quotes  are  removed,  and  the  rules  are 
allowed  to  perform  reductions  within  the  previously  protected  expression.  A  conditional  is 
rewritten  as  follows: 

(!f  true  Then  "El"  Else  "E2")  =4^  (El) 

B1  ;  ...  ;  Bn  B1  ;  ...  ;  Bn 
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We  only  need  one  rewrite  rule  to  support  all  of  our  constructs.  First,  we  note  that  all 
expressions  preceded  by  a  start  off  as  opaque  (enclosed  in  double  quotes).  Whenever  an 
identifier  is  selected  as  a  redex,  if  it  is  bound  to  an  opaque  expression  the  quotes  are  removed. 
An  identifier  may  not  be  taken  as  a  redex  if  it  is  within  an  opaque  expression,  or  in  a  data 
structure. 

X  X 

B1  ;  ...  ;  Bn  ;  X  =  "El"  ^  B1  ;  . . .  ;  Bn  ;  X  =  El 

This  rule  applies  to  conses,  tuples,  algebraic  types,  arrays,  and  I-structures. 

Consider  the  following  definition: 

typeof  ints.from  =  N  ->  (list  N); 

def  ints.from  n  =  n  :  #ints_from  (n+1) ; 

As  an  example  rewrite  sequence,  consider  the  evaluation  of  an  expression.  We  begin  with 
a  query  for  the  second  element  of  a  stream  and  an  empty  environment.  Expressions  that  are 
underlined  are  about  to  be  rewritten. 

hd  (tl  ( ints_from_l ) ) 

<no  bindings> 

The  application  of  intsjfrom  is  expanded,  introducing  formal  parameter  nl.  A  binding  for 
nl  appears  in  the  binding  list.  Note  the  opaque  expression:  we  substitute  for  the  exposed  nl, 
but  the  nl  buried  in  the  opaque  expression  may  not  be  rewritten. 

=>■ 

hd  (tl  (nl  :  "iiits_from  (nl+1)")) 

nl  *  1 

=> 

hd  (tl  (1  :  "ints-from  (nl-*-!)")) 

nl  =  1; 

Apply  the  cons,  introducing  new  names  for  the  parts. 


nl  »  1;  xOOl  =  1;  x002  » 


"ints-from  (nl+1)" 


Apply  the  tail  function. 


hd  x002 

nl  ■  1;  xOOl  »  1;  x002  *  "intS-from  (nl+1)" 
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Use  the  new  rule  to  expose  the  opaque  expression. 


=> 

hd  x002 

nl  =  1;  xOOl  =  1;  x002  -  intsjfrom  (nH-1) 

Apply  ints_from  again,  introducing  formal  parameter  n2  and  another  opaque  expression. 


hd  x002 

nl  =  1;  xOOl  =  1;  x002  =  n2  :  "ints-from  (n2+l)";  n2  =  al+1 
=► 

hd  x002 

nl  =  1;  xOOl  =  1;  x002  *  n2  :  "ints_from  (n2+l)";  n2  =  1^-1 
=> 

hd  x002 

nl  =  1;  xOOl  =  1;  x002  =  n2  :  "ints_from  (n2+l)";  n2  =  2 

Substitute  for  the  exposed  n2.  As  before,  the  n2  buried  in  the  opaque  expression  may  not 
be  rewritten. 


hd  x002 

nl  *  1;  xOOl  =  1;  x002  =  2  ;  "intS-from  (n2-»-l)";  n2  =  2 
Applying  another  cons  operator,  we  introduce  two  more  new  names. 


hd  x002 

nl  =  1;  xOOl  =  1;  x002  =  <cons  x003  x004>;  n2  =  2; 
x003  =  2;  x004  =  "ints_from  (n2+l)" 

Lookup  x002  in  the  environment. 

=> 

hd  <con3  x003  x004> 

nl  =  1;  xOOl  =  1;  x002  =  <cons  x003  x004>;  n2  =  2; 
x003  =  2;  x004  =  "ints-from  (n2+l)" 

Apply  the  head  function. 

=> 

x003 

nl  =  1;  xOOl  =  1;  x002  =  <cons  x003  x004>;  n2  =  2; 
x003  =  2;  x004  ®  "ints_from  (n2+l)" 
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Finally,  lookup  x003  in  the  environment. 


=> 

2 

nl  =  1;  xOOl  =  1;  x002  =  <cons  x003  x004>;  n2  =  2; 
x003  =  2;  x004  *  "intsjfrom  (n2+l)" 

Although  there  are  slight  variations  on  the  reduction  order  for  the  above  example,  there  are 
no  essential  differences  from  the  preceding  reduction  order. 

Removing  the  quotes  and  exposing  an  opaque  expression  is  analogous  to  enabling  the  eval¬ 
uation  of  a  delayed  expression.  According  to  our  new  rewrite  rule,  a  delayed  expression  is 
evaluated  at  most  once,  as  soon  as  its  value  is  needed. 

Arithmetic  sequences  and  stream  comprehension  are  syntactic  sugars  and  are  thus  defined 
within  the  language.  They  need  no  special  treatment  here. 

2.4  Abstract  Costs  and  Benefits 

There  are  several  abstract  costs  to  our  approach.  The  first  is  expressive  clarity.  What  are 
the  repercussions  of  sprinkling  our  programs  with  hash  marks?  Can  annotated  programs  be 
understood  clearly  and  intuitively?  Next  is  expressive  power.  Can  we  code  all  the  programs  we 
would  like  in  a  straightforward  manner?  This  is  an  obvious  question,  since  we  cannot  express  all 
the  programs  that  can  be  expressed  in  a  lazy  functional  language.  We  consider  these  potential 
problems  in  detail  in  Chapter  4. 

Are  there  advantages  to  explicit  annotations?  Laziness  is  expensive,  and  annotations  prevent 
us  from  ignoring  the  expense  by  sweeping  it  under  the  rug. 

The  first  and  foremost  advantage  of  our  approach  is  the  fact  that  it  adds  a  power  to  the 
language  that  it  did  not  formerly  possess.  We  can  now  express  infinite  data-structures  and  data 
structures  with  expensive  slots  directly. 

2.5  Summary 

This  chapter  extends  Id  to  incorporate  lazy  data-structures.  When  constructing  any  data 
structure  in  Id#,  the  expression  destined  for  any  slot  may  be  annotated  with  a  to  indicate 
that  the  evaluation  of  an  expression  should  be  delayed  until  a  consumer  requests  the  value  of  the 
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slot.  Using  lazy  data-structures,  we  can  write  programs  using  both  the  infinite  data-structures 
programming  paradigm  and  the  data  structures  with  expensive  slots  programming  paradigm. 

Arithmetic  sequences  and  stream  comprehension  provide  special  syntcix  for  stream  program¬ 
ming.  Explicit  control  over  unwinding  allows  the  programmer  to  express  speculative  computa¬ 
tion. 
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Chapter  3 

Implementation  of  Id# 


In  order  to  implement  the  language  extensions  described  in  the  previous  chapter,  support 
is  required  both  in  compiler  and  architecture.  In  this  chapter  we  describe  our  approach  to 
implementation,  support  from  the  architecture,  compilation  techniques,  and  the  concrete  costs 
and  benefits  of  our  approach. 


3.1  Approach 

The  compiler  generates  dataflow  graphs  to  build  thunks  which  embed  the  delayed  expressions, 
and  the  architecture  provides  synchronization  for  triggering  the  delayed  expressions  and  read¬ 
ing  the  results.  We  have  already  presented  the  language  extensions,  i.e.,  the  system  at  the 
highest  level.  Now  we  present  the  support  structure  from  the  bottom  up.  First,  we  p..sent  the 
architectural  extensions  that  provide  the  necessary  hardware  support.  Then,  we  present  the 
compiler  extensions  needed  to  reduce  our  language  to  architectural  primitives. 

3.2  Architectural  Extensions 

Several  architectural  extensions  are  needed  to  support  lazy  data-structures.  First,  we  extend 
I-structures*  [10,  23]  to  L-structures  (lazy  structures),  which  provide  the  synchronization  re¬ 
quired  to  support  lazy  data-structures.  Next,  we  provide  support  for  suicide  procs,  procedures 
embedding  delayed  expressions  that  are  invoked  by  the  memory  system  rather  than  by  an  ex¬ 
plicit  procedure  call.  And  finally,  we  introduce  a  new  manager  (dataflow  run-time  system  call) 
to  invoke  suicide  procs. 

'Unless  otherwise  noted,  in  this  chapter  the  terra  “I-structure”  refers  to  the  an  implementation  mechanism, 
not  a  language  construct. 
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3.2.1  L-structures 


Tuples,  algebraic  types,  I-structures  (the  language  “I-structure”),  and  arrays  will  all  be  imple¬ 
mented  using  L-structures.  An  L-structure  is  like  an  array  in  any  programming  language,  but 
each  slot  has  synchronization  built  into  it.  L-structures  are  a  variation  on  l-structures  [9].  We 
begin  our  development  of  L-structures  by  first  presenting  l-structures. 

Each  slot  of  an  I-structure  has  status  bits  associated  with  it.  If  a  consumer  arrives  before 
the  producer,  the  fetch  is  remembered  locally  by  the  I-structure  slot  in  a  list  until  the  value  is 
stored.  The  state  of  the  slot  is  recorded  by  the  status  bits.  Figure  3.1  is  the  state  transition 
diagram  for  an  individual  I-structure  slot  (as  opposed  to  the  entire  I-structure).  Activity  caused 
by  a  transition  is  indicated  after  the  slash. 


Figure  3.1:  State  Transition  Diagram  for  an  I-structure  Slot 

There  are  three  operations  on  l-structures,  as  we  have  seen  in  Figure  3.1.  I-array  takes  an 
integer  argument  and  returns  an  empty  I-structure  of  corresponding  size,  store  places  data  in 
an  empty  slot,  satisfying  any  deferred  fetches,  fetch  returns  the  data  if  it  is  present,  otherwise, 
it  registers  a  deferred  fetch.  These  operations  are  summarized  in  Table  3.1. 

I-array  size  =>■  <I-structure-address> 

store  <I-structure-address>  value  ^  <acknowledgment> 

fetch  <I-structure-address>  destination  =>■  value 


Table  3.1;  I-structure  Operations 

Two  paths  through  the  state  transition  diagram  are  possible.  Both  paths  are  demonstrated 
in  parallel  in  the  following  simulation: 

Simulation  Step  1;  Allocate  x,  an  I-structure  of  size  two.  Both  slots  start  in  the  empty 
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state.  The  state  of  each  slot  appears  below  the  data. 

X  =  I-array  2 

Figure  3.2:  Simulation  Step  1:  Allocate  an  I-structure  of  Size  Two 

Simulation  Step  2:  Simultaneously  store  the  number  5  in  the  first  slot  x[0]  (I-structures 
are  zero-indexed)  and  fetch  the  contents  of  xCl]  for  reader  rl.  x[0]  enters  the  present  state 
as  it  now  holds  valid  data,  and  xCl]  enters  the  deferred  state  and  contains  a  pointer  to  the 
deferred  fetch  list  containing  rl  (the  slash  in  the  right  half  of  the  deferred  fetch  list  indicates 
the  end  of  the  list). 

store  5  in  x[0] 
letch  x[l]  lor  rl 

Figure  3.3:  Simulation  Step  2:  store  in  x[0],  fetch  from  x[l] 

Simulation  Step  3:  fetch  the  contents  of  x[0]  for  reader  r2  and  fetch  the  contents  of 
x[l]  for  r3.  Since  x[0]  has  valid  data,  r2  is  sent  the  stored  value  5.  Since  x[l]  ha.s  already 
been  deferred  (no  data),  r3  is  pushed  onto  the  deferred  fetch  list  for  x[l] . 

fetch  x[0]  for  r2 
letch  x[l]  lor  r3 

Figure  3.4:  Simulation  Step  3:  fetch  from  x[0],  fetch  from  x[l] 

Simulation  Step  4:  store  the  value  10  in  x[l].  x[l]  makes  the  transition  to  the  present 
state  as  the  deferred  fetches  are  satisfied  by  sending  the  value  to  r3  and  rl. 

The  two  slot  I-structure  we  have  been  modeling  could  have  been  a  2-tuple,  a  cons  cell,  an 
I-structure  of  size  two  (the  language  “I-structure”),  or  a  variety  of  other  data  structures.  All 
data  structures  look  the  same  at  this  level:  in  Id,  all  data  structures  are  implemented  using 
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<send  10  to  r3> 

<8end  10  to  rl> 

Figure  3.5:  Simulation  Step  4:  store  in  x[l] 

I-structures,  which  provide  the  low-level  synchronization  needed  to  support  producer/consumer 
synchronization  in  a  multiprocessor. 

Now  we  introduce  L-structures  to  provide  a  substrate  for  implementing  lazy  data-structures. 
L-structures  provide  support  for  producer/consumer  synchronization  and,  additionally,  for  ex¬ 
plicitly  delaying  and  implicitly  evaluating  the  contents  of  individual  slots. 

l-structures  synchronize'  readers  and  writers;  readers  are  stalled  until  the  value  is  written. 
L-structures  also  synchronize  readers  and  writers,  and,  in  addition,  they  synchronize  the  eval¬ 
uation  of  delayed  expressions.  Evaluation  is  stalled  until  there  is  a  reader  (a  request  for  the 
value),  and  readers  are  stalled  until  the  value  arrives. 

L-structures  have  four  memory  operations:  L-array,  store-thunk,  store-data,  and  fetch. 
L-array  takes  an  integer  argument  and  returns  an  empty  L-structure  of  corresponding  size; 
store-thunk  places  a  thunk  in  an  empty  slot;  fetch  ejects  any  thunk  that  is  present,  causing 
the  thunk  to  be  evaluated,  and  registers  a  deferred  fetch;  and  store-data  places  data  in  a  slot, 
satisfying  any  deferred  fetches.  L-structure  operations  are  summarized  in  Table  3.2. 

L-array  size  =>  <L-structure-address> 

store-thimk  <L-structure-addres8>  thunk  =>  <acknowledgment> 
store-data  <L-structure-address>  value  <acknowledgment> 

fetch  <L-structure-address>  destination  =>  value 

Table  3.2:  L-structure  Operations 

When  used  to  delay  computation,  an  L-structure  slot  is  “written  twice”:  first  with  the 
thunk,  then  with  the  value  the  thunk  computes.  We  add  the  delayed  and  evaluating  states 
to  the  I-structure  state  transition  diagram  to  achieve  lazy  behavior,  as  shown  in  Figure  3.6. 
Figure  3.6  is  the  state  transition  diagram  for  each  slot  of  an  L-structure. 

Five  “memory  snapshots”  of  a  single  slot  of  an  L-structure  corresponding  to  the  five  states 
of  Figure  3.6  are  arranged  in  Figure  3.7.  The  state  of  the  slot  appears  in  the  lower  box,  and  the 
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deferred  list, 


data  can  be  found  in  the  upper  box.  The  slot  starts  in  the  empty  state  (on  the  left).  Suppose  a 
store-thunk  operation  is  the  first  to  arrive.  The  thunk  (delayed  expression  and  environment) 
is  stored  in  the  slot  as  shown  at  the  top.  Note  the  pointer  from  the  thunk  back  to  the  delayed 
slot.  When  a  fetch  operation  arrives,  the  delayed  expression  is  spawned  (by  magic,  for  now). 
The  star-burst  at  the  lower  right  represents  the  spawned  process.  The  fetch  is  put  on  a  deferred 
fetch  list,  along  with  any  other  fetches  that  arrive.  rl  and  r2  represent  readers  of  the  slot  that 
have  been  deferred.  Eventually,  the  spawned  computation  will  finish  evaJuating  the  delayed 
expression  and  issue  a  store-data  for  this  slot.  The  value  of  the  previously  delayed  expression 
is  stored,  and  deferred  fetches  are  satisfied. 

Suppose  a  fetch  reaches  the  slot  first.  The  status  of  the  delayed  slot  starts  in  the  empty 
state,  moves  to  the  deferred  state,  where  more  fetches  may  be  queued,  to  the  evaluating  state, 
where  the  delayed  expression  is  spawned,  and  down  to  the  present  state,  where  the  value  of  the 
previously  delayed  expression  sits,  awaiting  other  readers. 

L- structure  operations  are  a  superset  of  (and  consistent  with)  I-structure  operations.  Corre¬ 
spondingly,  the  L-structure  transition  diagram  embeds  the  I-structure  transition  diagram,  and 
as  a  result,  I-structure  behavior  can  also  be  achieved  in  an  L-structure. 

There  are  two  remaining  paths  through  the  snapshots  which  correspond  to  I-structure  be¬ 
havior  (no  delayed  computation,  just  producer/consumer  synchronization).  If  a  fetch  arrives 
before  the  store-data,  the  status  of  the  slot  moves  from  the  empty  state  to  the  deferred  state 
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Figure  3,7:  State  Transition  Memory  Snapshots 


where  more  readers  may  be  deferred,  and  down  to  the  present  state  when  the  data  arrives  on 
a  store-data  operation.  If  the  store-data  precedes  any  fetches,  the  status  of  the  slot  simply 
moves  from  the  empty  state  to  the  present  state,  where  the  data  waits  for  any  readers. 

In  Id#,  tuples,  algebraic  types,  arrays,  and  I-structures  (the  language  “I-structure”)  can  aU 
be  implemented  using  L-structures. 

Minimal  changes  are  needed  in  the  TTDA’s  I-Structure  Memory  Controller  [23]  to  ac¬ 
complish  the  desired  new  behavior  on  the  TTDA.  Also,  this  behavior  is  simple  to  achieve  on 
Monsoon  [38]. 

3.2.2  Thunks 

A  thunk  embeds  the  delayed  expression  and  its  environment.  The  delayed  expression  is  embed¬ 
ded  (at  compile  time)  in  a  dataflow  graph  that  fetches  its  free  variables  from  the  environment 
and  stores  the  result  in  the  slot  containing  the  thunk.  This  dataflow  graph  (along  with  the 
dataflow  graph  to  compute  the  delayed  expression  itself)  is  known  as  a  suicide  proc.  When 
a  slot  containing  a  thunk  is  read,  the  memory  system  sends  the  thunk  to  a  thunk  invocation 
manager,  which  invokes  the  suicide  proc. 

The  detailed  structure  a  thunk  is  presented  in  Section  3.3.1. 

3.2.3  Suicide  Procs 

Normally,  parent  and  child  procedures  must  communicate  to  pass  arguments  and  to  return 
results.  Delayed  expressions,  however,  are  pa.ssed  their  arguments  implicitly,  and  store  the 
result  themselves.  Thus,  the  processing  of  a  delayed  expression  needs  no  direct  linkage  with 
either  the  process  that  created  it  or  the  processes  that  are  consuming  its  result.  A  new  procedure 
call/return  mechanism  to  support  this  behavior  is  developed.  Procedures  that  are  invoked  using 
this  new  mechanism  are  called  suicide  procs. 

A  suicide  proc  is  invoked  by  the  thunk  invocation  manager,  not  by  an  explicit  procedure 
call.  A  potential  problem,  however,  is  that  a  thunk  generates  an  independent  thread,  i.e.,  a 
procedure  without  a  parent.  This  may  affect  the  policy  for  resource  allocation.  Currently, 
running  a  procedure  unfolds  into  a  single  execution  tree.  Once  we  start  spawning  processes, 
there  will  be  an  execution  forest,  and  trees  will  have  independent  lifetimes.  It  is  likely,  however, 
that  the  run-time  system  will  deal  with  this  complexity  for  other  reasons.  For  example,  the 
storage  manager  and  other  run-time  systems  will  run  independently. 
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The  detailed  structure  a  suicide  proc  is  presented  in  Section  3.3.2. 


3.2.4  The  Thunk  Invocation  Manager 

In  a  dataflow  system,  a  globally  defined  set  of  routines  known  as  managers  form  the  run-time 
system  [8].  We  need  a  new  manager  for  spawning  suicide  procs,  the  thunk  invocation  manager, 
similar  to  the  manager  for  invoking  procedures.  In  our  system,  this  manager  would  work  as 
follows. 

The  thunk  invocation  manager  is  passed  a  thunk.  The  thunk  contains  a  pointer  to  a  suicide 
proc,  which  the  thunk  invocation  manager  manager  must  fetch.  The  thunk  invocation  manager 
must  also  acquire  a  new  invocation  frame  for  the  suicide  proc,  and  send  the  thunk  to  the  suicide 
proc.  The  suicide  proc  can  then  read  the  free  variables  from  the  thunk,  compute  the  expression, 
and  store  the  result  back  into  the  delayed  slot.  Since  the  suicide  proc  wiU  deallocate  its  own 
invocation  frame,  no  thunk  termination  manager  is  needed. 


3.3  Compilation  Techniques 

This  section  concentrates  on  the  details  of  compiling  lazy  data-structures.  The  detailed  struc¬ 
ture  of  a  thunk  is  presented  as  well  as  the  dataflow  graph  required  to  build  it.  The  structure 
of  a  suicide  proc,  the  dataflow  graph  embedding  the  delayed  expression,  is  also  presented. 

3.3.1  Thunks 

In  this  section,  we  make  the  following  conventions: 

•  e  is  the  delayed  expression 

•  VI ,  ...  ,  Vn  are  the  free  variables  of  e 

•  Se  is  the  suicide  proc  that  embeds  e 

•  result-address  is  a  pointer  to  the  delayed  slot 

A  thunk  contains  a  pointer  to  its  suicide  proc  (Se,  which  embeds  delayed  expression  e),  the 
result-address,  and  the  values  of  all  the  free  variables  (VI,  ...  ,  Vn)  of  the  delayed  expression, 
as  shown  in  Figure  3.8. 
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At  this  point  we  are  ready  to  look  at  our  first  dataflow  graph.  In  our  figures,  every  dataflow 
graph  node  is  labeled  with  the  type  of  operation  executed  by  the  node  (in  the  larger  box) 
and  an  instruction  number  (in  the  smaller  box).  Immediate  constants  associated  with  nodes 
are  written  in  irregular  pentagons  above  the  nodes.  Data  paths  correspond  to  arcs;  inputs 
correspond  to  arcs  with  dangling  tails;  and  outputs  correspond  to  arcs  with  dangling  heads. 

The  dataflow  graph  to  build  a  thunk  is  shown  in  Figure  3.9.  The  nodes  are  numbered  z  to 
2r  +  2n  +  5;  a  total  of  2n  +  6  operations  are  required  to  construct  a  thunk  corresponding  to  a 
delayed  expression  that  has  n  free  variables.  The  numbering  does  not  start  at  zero  to  remind 
us  that  this  dataflow  graph  is  part  of  a  larger  dataflow  graph. 
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Node  2  (the  node  labeled  z)  allocates  the  thunk  as  soon  as  a  trigger  is  arrives.  (Almost  any 
token  could  be  used  for  a  trigger.  The  result-address  could  be  used,  for  example.)  The  thunk  is 
sent  to  node  z  -|- 1  which  stores  the  thunk  in  the  lazy  slot  (at  result-address).  The  thunk  is  also 
sent  to  nodes  z  +  2  through  z  -f  n  -|-  3  which  store  data  at  various  offsets  in  the  thunk.  Node 
z  +  2  stores  the  result  address  in  the  thunk  at  offset  zero,  form-address  store-data  is  similar 
to  store-data,  but  it  accepts  accepts  structure  pointer,  offset,  and  value  as  inputs  instead  of 
an  address  and  a  value.  An  address  (like  the  result-address)  can  be  computed  from  a  structure- 
pointer  and  an  offset.  2-1-3  stores  a  link  to  the  suicide  proc  Se  in  the  thunk  at  offset  one.  The 
right  constant  input  to  node  2  -|-  3  (the  link  to  Se)  will  ultimately  get  patched  at  load  time  to 
have  the  proper  value.  Nodes  2  -|-  4  through  2  -|-  n  4-  3  store  the  free  variables  of  e  at  offsets  two 
through  n  -f  1  respectively.  All  store  nodes  produce  a  termination  signal  (n  4-  3  altogether) 
which  are  combined  into  a  single  signal  by  the  signal  tree.  In  the  TTDA,  signal  trees  are  built 
with  binary  nodes,  so  the  signal  tree  in  Figure  3.9  would  require  (n4-3)-l  =  n4-2  nodes. 
Monsoon  is  likely  to  support  higher  fan-in  for  signalling,  so  fewer  nodes  would  be  required. 

Note  that  the  thunk  is  stored  at  the  result-address  (node  2  4-  1)  in  parallel  with  the  free 
variables  being  stored  (nodes  2  4-4  through  2  4-  n  4-  3).  So  if  the  thunk  were  spawned  (which 
could  not  happen  before  the  link  to  Se  were  in  place:  node  2  4-3  fires)  because  someone  issued  a 
fetch  at  the  result-address  (which  could  not  happen  before  the  thunk  were  in  place:  node  2  4- 1 
fires),  the  suicide  proc  could  begin  executing  before  all  the  free  variables  are  in  plaee.  In  fact, 
some  free  variables  may  not  yet  be  computed. 

The  values  of  the  free  variables  of  the  delayed  expression,  the  result-address,  and  a  trigger 
are  the  inputs  to  the  graph  fragment,  and  the  termination  signal  is  the  output.  A  copy  of  this 
dataflow  graph  schema  appears  each  time  a  slot  of  a  data  structure  is  assigned  lazily.  Since  the 
number  of  free  variables  depends  on  the  expression  being  delayed,  the  size  of  the  graph  varies. 

Some  approaches  to  storage  management  insist  that  locatives  (addresses  of  interior  structure 
words)  such  as  the  result-address  not  exist  unless  a  full-fledged  pointer  to  the  object  exists,  and 
some  exclude  locatives  entirely.  Both  of  these  restrictions  can  be  accommodated  by  expanding 
the  thunk  with  an  additional  slot,  and  storing  the  pointer  to  the  L-structure  and  the  offset  of 
the  delayed  slot  instead  of  Just  storing  the  result-address. 
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3.3.2  Suicide  Procs 


A  suicide  proc  has  a  single  entry  point  (standard  Id  procedures  have  additional  entry  points 
which  receive  arguments).  The  thunk  invocation  manager  sends  the  thunk  to  the  suicide  procs 
entry  point  which  is  located  at  a  fixed  offset  in  the  code  for  the  suicide  proc. 

The  suicide  proc  unpacks  the  free  variables  of  the  delayed  expression  as  well  as  the  result 
address.  The  delayed  expression  can  now  proceed  normally.  When  the  delayed  expression 
produces  a  result,  the  result  is  stored  at  the  return  address,  back  in  the  previously  delayed  slot. 
After  the  result  is  stored  and  all  activity  associated  with  the  delayed  expression  has  completed, 
the  suicide  proc  deallocates  its  invocation  frame  and  terminates. 

The  schema  for  suicide  procs  is  given  in  Figure  3.10. 


Figure  3.10:  Schema  for  a  Suicide  Proc 

The  dataflow  graph  in  Figure  3.10  is  numbered  from  zero  to  indicate  that  a  suicide  proc  is 


not  embedded  in  any  larger  graph,  n  +  5  nodes  are  drawn  explicitly,  and  the  rest  belong  to  the 
delayed  expression. 

Node  zero  is  the  entry  point  for  a  suicide  proc  and  must  ultimately  be  stored  at  a  fixed  offset 
in  the  assembled  code.  Node  zero,  an  identity,  receives  the  thunk  and  distributes  it  to  nodes 
one  through  n+  1.  Like  the  form- address  store-data  nodes  in  the  dataflow  graph  that  loaded 
the  thunk  with  values  (Figure  3.9),  the  form-address  fetch  nodes  retrieve  the  data  stored  at 
the  indicated  offset  in  the  thunk.  Node  one  fetches  the  result-address  from  thunk  offset  zero. 
Nodes  two  through  n  -|-  1  fetch  the  values  of  the  free  variables  VI  through  Vn  respectively,  and 
deliver  them  to  the  graph  for  the  expression  that  was  delayed.  The  expression  graph  is  embedded 
directly  in  the  suicide  proc  {i.e.,  no  additional  procedure  calls  are  necessary).  The  graph  for 
the  embedded  expression  eventually  produces  a  result  and  (optionally)  a  termination  signal. 
The  result  is  sent  to  node  n  -|-  2,  where  the  return  address  is  also  sent.  Node  n  -f  2  stores  the 
result  at  the  result-address  placing  a  value  in  the  previously  delayed  slot.  Node  n  -f-  2  produces 
a  termination  signal  which  is  combined  with  the  signal  generated  by  the  embedded  expression 
(il  it  generated  one)  by  node  n  -f  3  (a  gate,  which  was  drawn  as  a  bow-tie  in  Section  1.5.1). 
A  gate  node  passes  on  its  “top”  input  when  both  the  “top”  and  “side”  inputs  are  received. 
When  node  n-i-3  produces  a  token,  it  is  guaranteed  to  be  the  only  remaining  token  associated 
with  the  suicide  prod  invocation.  Node  n  +  4  consumes  this  token  as  it  deallocates  the  current 
invocation  frame. 

3.4  Concrete  Costs  and  Benefits 

There  are  two  sets  of  concrete  costs  to  our  approach.  First  are  the  additional  hardware  costs, 
which,  as  we  have  seen,  are  minimal.  Two  additional  states  are  required  for  the  structure 
transition  diagram.  Next  is  the  cost  of  our  thunk  mechanism  as  mezisured  in  machine  operations. 
We  will  consider  this  cost  in  greater  detail. 

In  traditional  lazy  interpreters,  we  can  attribute  a  cost  to  each  of  the  following: 

1.  creating  a  thunk 

2.  testing  if  a  thunk  has  been  evaluated 

3.  evaluating  a  thunk 

4.  reading  the  value 
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Our  hardware  support  allows  for  substantial  savings.  Creating  a  thunk,  evaluating  a  thunk, 
and  reading  a  value  are  done  in  the  usual  way.  But  testing  to  see  if  a  thunk  has  been  evaluated 
is  better  than  free;  it  is  implicit.  That  means  that  users  spend  no  time  or  code  space  in  testing 
or  invoking  a  thunk.  The  first  read  of  a  delayed  slot  implicitly  spawns  the  corresponding 
suicide  proc,  and  all  successive  reads  behave  in  the  usual  fashion.  Conventional  systems  require 
repeated  checking  to  see  if  the  thunk  has  been  forced.  Furthermore,  we  can  traverse  streams 
exactly  like  lists,  as  long  as  we  don’t  traverse  the  spine  eagerly  looking  for  the  end. 

The  power  of  our  integrated  mechanism  should  not  be  underestimated.  Consider  two  stan¬ 
dard  ways  of  building  streams  without  hardware  support.  The  delayed  cell  might  be  explicit, 
in  which  case  the  consumer  must  make  two  references  each  time  the  stream  is  walked  one  step. 
This  technique  is  pictured  abstractly  in  Figure  3.11,  which  represents  the  stream  of  integers  from 
one,  with  the  first  three  cell  evaluated.  The  cons  cells  and  the  delayed  cells  alternate.  Even  if 
the  delayed  cell  is  traversed  implicitly,  it  must  be  traversed.  Concert  [22]  has  hardware  support 
to  traverse  futures  (i.e.,  delayed  cells)  implicitly,  and  the  garbage  collector  collapses  evaluated 
futures.  Without  implicit  traversal  of  the  delay  (including  implicit  forcing)  a  consumer  cannot 
treat  this  structure  as  a  list. 


previously  previously  currently 

delayed  delayed  delayed 

cell  cell  cell 

Figure  3.11:  A  Stream  with  Explicit  Delay  Cells 

Another  possibility  is  that  a  flag  can  be  embedded  in  the  stream  cell  itself.  This  can  be 
accomplished  by  expanding  the  stream  cell  to  a  triple.  This  technique  is  pictured  abstractly  in 
Figure  3.12,  which  represents  the  stream  of  integers  from  one,  with  the  first  three  cell  evaluated. 
Unfortunately,  we  end  up  with  a  new  the  structure  is  fundamentally  different  from  a  list.  This 
technique  does  not  extend  easily  to  other  data  structures. 

In  both  of  these  alternatives,  primitives  for  synchronization  must  be  provided,  unless  the 
mechanism  is  implicit. 


63 


previously 

delayed 

hybrid 

cell 


previously 

delayed 

hybrid 

cell 


currently 

delayed 

hybrid 

ceU 


Figure  3.12:  A  Stream  with  “Special  Stream-Cons”  Cells 


3.4.1  In-Thunk  Substitution 


Id  provides  a  facility  for  the  inline  substitution  of  procedures,  allowing  programs  to  use  proce¬ 
dural  abstraction  at  no  cost,  simultaneously  trading  execution  time  for  program-store  space.  A 
programmer  may  use  the  defsubst  keyword  in  place  of  the  def  keyword  in  defining  a  proce¬ 
dure.  Whether  or  not  a  procedure  is  substituted  inline  or  not  makes  no  difference  to  program 
behavior;  the  procedure  call  version  and  the  inline  version  have  identical  semantics.  In  Id,  pro¬ 
cedures  are  declared  inlineable  at  the  point  of  definition.  It  is  also  possible  to  make  this  decision 
at  the  point  of  call.  This  possibility  is  particularly  interesting  when  considered  in  conjunction 
with  our  lazy  data-structure  mechanism. 

A  programmer  declares  a  definition  inlineable  by  using  the  defsubst  keyword.  In  this  way, 
a  single  annotation  corresponds  to  a  potentially  large  number  of  procedure  calls.  Note  that 
the  defsubst  annotation  is  not  a  directive  to  inline  a  procedure,  but  merely  a  permission  or 
a  suggestion.  Consequently,  recursive  definitions  can  be  marked  def  subst-able  without  fear  of 
the  compiler’s  diverging.  A  recursive  defsubst  cannot  be  substituted  repeatedly  until  steady 
state  is  reached  as  an  infinite  chain  of  ever  larger  procedures  would  be  generated.  Now  let  us 
consider  the  relationship  of  inlining  to  thunks. 

Consider  the  process  of  extending  the  tail  of  a  recursively  defined  stream,  say  the  integers, 
as  defined  using  ints_from.  Each  time  the  stream  is  extended,  there  is  a  thunk  evaluation,  as 
well  as  a  call  to  ints-from.  There  is  a  mutual  recursion  between  ints..from,  a  user-defined 
procedure,  and  the  suicide  proc,  created  by  the  compiler,  but  there  is  only  one  suspension  for 
each  cycle.  The  natural  breaking  point  is  at  the  suicide  proc,  for  that  is  where  the  computation 
is  suspended.  By  substituting  ints-f  rom  into  the  suicide  proc  code  block,  argument  packing  and 
unpacking  as  well  as  a  manager  call  to  allocate  an  invocation  frame  can  be  avoided,  ints-f  rom 
sets  up  the  stream,  and  from  there  on  the  recursion  is  from  suicide  proc  to  suicide  proc.  Consider 
the  following  definitions  of  ints  and  intS-f  rom: 
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typeof  ints.from  =  N  ->  (list  N) ; 
def  ints.from  n  =  n:#ints_from  (n+1) ; 

typeof  ints  =  list  M; 
ints  =  ints_from  1; 

Each  call  to  ints_from  builds  a  thunk.  When  the  thunk  is  spawned,  all  it  does  is  call 
ints_from  and  do  some  bookkeeping.  That  call  to  ints.from  builds  another  thunk,  with  a 
pointer  to  the  same  suicide  proc  as  had  the  last  thunk.  All  the  calls  to  ints_from  after  the 
first  can  be  substituted  inline. 

We  are  inlining  a  procedure  captured  by  a  suicide  proc.  Alternatively,  we  are  shifting  a 
procedural  recursion  broken  by  a  suicide  proc,  into  a  suicide  proc  recursion.  We  can  think  of 
this  as  unrolling  the  top  half  of  the  first  iteration  of  a  loop,  and  grouping  the  bottom  half  of 
each  iteration  with  the  top  half  of  the  next. 

This  shifting  of  the  procedure  boundary  greatly  decreases  the  marginal  cost  of  delaying 
expressions.  Infinite  data-structures  must  be  defined  with  some  sort  of  recursion,  and  we 
can  replace  the  procedural  recursion  by  suicide  proc  recursion,  which  is  slightly  more  expensive 
than  procedure  recursion,  as  it  must  access  memory  to  pass  arguments  (building  and  unpacking 
thunks),  but  cheaper  than  both  suicide  proc  recursion  and  procedure  recursion. 

3.4.2  Unwinding 

The  unwinding  facility  allows  the  programmer  to  perform  certain  kinds  of  speculative  compu¬ 
tation.  Suspensions  are  not  free,  and  unwinding  allows  the  programmer  to  trade  suspensions 
for  the  possibility  of  extra  work.  Consider  a  system  that  supports  persistent  objects.  When 
reconstituting  an  object  (bringing  it  into  the  name-space)  the  entire  object  need  not  be  recon¬ 
stituted  at  once.  If  the  object  were  a  long  list,  for  example,  the  system  might  reconstitute  as 
much  as  fit  on  one  page  of  memory,  and  delay  reconstituting  the  rest. 

3.4.3  Fetch  Elimination 

In  Id,  if  a  data  structure  slot  were  fetched  and  discarded,  the  fetch  would  be  discarded  by  the 
compiler  as  well  as  dead  code.  Why  read  a  value  if  it  won’t  be  used?  Now,  reading  a  value 
can  cause  a  delayed  expression  to  be  evaluated.  Can  we  still  eliminate  lone  fetches?  After  all, 
it  is  not  possible  for  a  consumer  (the  reader)  to  tell  if  the  slot  was  delayed,  in  general.  If  the 
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compiler  can  prove  that  the  program  behavior  won't  change,  then  fetch  elimination  is  safe. 
Fetch  elimination  is  not  always  safe. 

In  a  functional  language,  it  appears  that  fetch  elimination  is  safe.  If  an  expression  is 
riiscarded,  we  should  not  care  if  it  got  evaluated  since  there  are  no  non-local  effects.  What 
about  I/O?  Suppose  an  array  is  defined  with  several  lazy  slots.  A  programmer  may  wish  to 
force  the  evaluation  of  several  positions  before  printing  the  array  or  before  making  the  array  a 
persistent  object.  If  the  fetch  elimination  problem  were  not  present,  each  relevant  object  could 
be  fetched  and  discarded  as  we  do  not  want  the  values,  at  least  for  now. 

.\nd,  of  course,  non-functional  constructs  may  pose  a  problem.  In  Id#,  any  fetch  operation 
might  now  have  a  side  effect  as  a  result  of  the  presence  of  I-structures  in  the  language.  We 
alluded  to  this  potential  problem  when  presenting  lazy  tuples.  In  the  following  code,  a  store  to 
an  I-structure  is  captured  in  the  delayed  head  of  a  cons  cell: 

typeof  a  =  I.array  N; 
a  =  {a  =  I_array  (1,10); 
aCl]  =  1; 
aC2]  =  2 
in 

a>; 

typeof  cons.cell  =  list  N; 
cons.cell  =  {a [3]  =  3  in  10}  #:  nil; 

Whether  or  not  the  head  of  cons.cell  is  evaluated  effects  whether  or  not  a [3]  is  assigned, 
a  non-local  effect,  perhaps  a  multiple  definition  error. 

3.4.4  Sequentialization  Optimization 

When  a  stream  is  defined  recursively,  as  it  must  to  be  infinite,  the  size  of  thunks  corresponding 
to  successive  elements  of  the  stream  form  a  cycle.  The  integers,  as  defined  in  ints  jfrom,  require 
thunks  of  uniform  size.  The  following  stream,  however,  requires  thunks  of  two  different  sizes: 

typeof  needs_2_thunks  =  list  N; 
needs_2_thunks  =  two.free  1  2; 

typeof  two.free  =  N  ->  N  ->  (list  N) ; 

def  two.free  x  y  =  x  :i  one.free  (x+y)  ;  '/,  thunk  has  2  free  vars 

typeof  one.free  =  N  ->  (list  N); 

def  one.free  y  =  y  :#  two.free  y  (y+1);  */.  thunk  has  1  free  var 
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In  many  cases  we  can  determine  this  structure  statically.  Consider  the  integers,  once  again. 
Each  thunk  contains  a  pointer  to  code,  a  pointer  to  an  L-structure  slot,  and  the  value  of  an 
index.  If  we  can  reuse  the  thunk,  we  can  avoid  allocating  and  deallocating  the  thunk  itself,  and 
we  can  avoid  rewriting  unchanging  values.  This  can  be  done  with  some  sequentialization,  and 
very  little  overhead. 

Just  as  a  signal  tree  is  used  in  constructing  a  thunk  (Figure  3.9)  to  detect  when  the  store 
operations  have  completed,  additional  nodes  can  usually  be  added  to  detect  the  completion  of 
various  events. 

Reusing  thunk  shells  is  easier  than  the  general  problem  of  rewriting  locations  in  structures 
since  we  can  guarantee  that  only  one  code  block  points  to  the  thunk  at  a  time.  This  technique 
would  be  restricted  to  strict  expressions,  since  we  must  be  sure  that  the  last  computation  was 
finished  with  the  thunk.  Interestingly  enough,  this  reuse  of  thunks  does  not  interfere  with 
unwinding.  A  stream  can  only  be  suspended  in  one  place,  so  we  never  need  more  than  one 
thunk  at  a  time. 

Another  possibility  is  for  a  stream  to  retain  an  invocation  frame  forever.  The  feasibility  of 
this  depends  on  the  scarcity  of  invocation  frames  as  well  as  other  engineering  issues. 

3.4.5  Explicit  Deallocation 

As  we  have  noted,  by  the  time  a  suicide  proc  has  read  all  the  thunk  slots,  it  is  the  only  one 
with  a  pointer  to  the  thunk.  Hence,  a  suicide  proc  can  always  deallocate  its  thunk  explicitly, 
saving  on  garbage  collection  costs. 

3.4.6  Unstructured  Thunks 

Our  thunks  are  flat  data-structures.  We  can  take  advantage  of  the  particular  lexical  conditions 
where  the  thunk  is  defined.  Sometimes  the  environments,  or  parts  of  the  environments  can  be 
shared.  Furthermore,  since  each  thunk  is  used  exactly  once,  a  tailor  made  calling  convention 
can  be  used.  Similar  ideas  were  used  in  the  Rabbit  [42]  and  Orbit  [29]  compilers  for  specializing 
procedure  calls. 
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3.5  Summary 


This  chapter  presents  the  implementation  of  Id?^.  L-structures  are  developed  zis  the  implemen¬ 
tation  mechanism  for  all  lazy  data-structures.  Then  suicide  procs  (the  code  embedding  delayed 
expressions),  and  thunks  (the  data  structures  that  collects  a  pointer  to  a  suicide  procs  and  its 
environment)  were  presented  along  with  their  dataflow  graph.  The  thunk  invocation  manager, 
part  of  the  run-time  system  is  described. 

Several  concrete  issues  relating  to  efficiency  are  also  discussed. 
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Chapter  4 

Methodology  for  Programming 
with  Lazy  Data-Structures 


Now  that  we  have  a  mechanism  for  lazy  data-structures,  we  will  exercise  it.  Because  we  only 
allow  lazy  expressions  in  certain  contexts  (the  arguments  to  data  constructors)  our  model  is 
not  powerful  enough  to  express  all  the  programs  expressible  in  a  lazy  functional  language.  In 
this  chapter  we  will  demonstrate  the  expressive  power  and  limitations  of  programming  with 
lazy  data-structures. 

This  chapter  is  divided  into  two  sections,  what  we  can  and  can’t  do  with  lazy  data-structures. 
We  will  consider  many  of  the  classic  programs  used  by  the  functional  languages  community  to 
demonstrate  the  power  of  lazy  functional  languages. 

4.1  What  We  Can  Do  with  Lazy  Data-Structures 

In  this  section  we  show  how  to  code  many  of  the  classic  examples  using  lazy  data-structures. 
There  are  two  reasons  we  have  a  chance  of  succeeding  in  this  ta.sk,  even  though  our  language 
has  less  expressive  power  than  a  lazy  functional  language.  First,  although  non-strictness  as 
embodied  in  Id  is  weaker  than  laziness,  quite  often  it  will  suffice.  And  second,  laziness  is 
usually  associated  with  data  structures. 

We  consider  streams,  other  lazy  data-structures,  and  search  programs. 
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4.1.1  Streams 


Streams  are  the  most  famous  use  of  laziness*.  Streams  are  potentially  infinite  lists  that  expand 
on  demand,  and  provide  a  uniform  interface  for  controlling  uniform  data  of  unpredictable  size. 
In  a  lazy  language,  all  lists  behave  like  streams.  In  our  language,  however,  special  provisions 
must  be  taken  to  define  infinite  objects.  Consider  the  stream  of  positive  integers; 

typeof  ints_from  =  N  ->  (list  N); 
def  ints_from  n  =  n:#ints_froin  (n+1); 

typeof  ints  =  list  N; 
ints  =  ints_from  1; 

•/.=  1  :#  2  ;#  3  :#  ... 

Of  course,  we  could  have  used  the  special  syntax  for  arithmetic  sequences.  The  integers 
might  be  consumed  as  follows.  The  definition  of  add-first  uses  pattern  matching  to  destructure 
arguments.  The  two  dots  indicate  that  the  second  clause  should  only  be  considered  if 

the  pattern  in  the  first  clause  fails  to  match.  Normally,  all  clauses  can  be  considered  in  parallel. 
Only  one  clause  is  allowed,  and  is  only  considered  if  the  patterns  in  all  other  clauses  fail 
to  match. 

typeof  add_first_n  =  N  ->  (list  N)  ->  N; 
def  add_fir8t_n  Os  =0 
I . . add_f irst.n  n  (x:xs)  =  x  +  add_first_n  (n-1)  xs; 

typeof  triangle  =  N  ->  N; 

def  triangle  n  =  add.first.n  n  ints; 

As  compared  to  the  equivalent  program  written  in  a  lazy  functional  language,  ints.f  rom,  the 
producer,  requires  a  single  annotation,  and  add_f irstji,  the  consumer,  needs  no  annotation. 
F'unctions  that  consume  streams  are  identical  to  the  functions  that  consume  lists.  The  following 
functions  are  standard  stream  producers,  transformers,  and  consumers.  Where  appropriate,  a 
stream  comprehension  version  is  also  given  (indicated  by  a  trailing  underscore). 

map  a  uneury  function  over  a  stream 
typeof  smapl  *  (*0  ->  *1)  ->  (list  *0)  ->  (list  *1); 
def  smapl  f  nil  =  nil 
I  smapl  f  (x:xs)  »  f  x  :f  smapl  f  xs; 

‘We  will  not  engage  in  a  discussion  of  the  utility  of  stream-style  programming,  but  we  will  use  it  as  a  vehicle 
for  discussing  our  system. 
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def  smapl.  f  s  =  {:#  f  x  ||  x  <-  8>; 

%-  f  si  :i  f  s2  :t  s3  :i  ... 

y.y.X  the  squeores  of  the  integers 

typeof  squares  =  list  N; 

squares  =  smapl  {fun  x  »  x*x}  (upfrom  1) ; 

squares.  =  {:i  x*x  1 1  x  <-  upfror.  1}; 

•/,=  1  :i  4  :#  9  :i  .  .  . 

'/•'/.%  scale  a  stream  by  a  constant  factor 
typeof  scale  =  N  ->  (list  N)  ->  (list  N) ; 
def  scale  a  s  =  smapl  ((*)a)  s; 
def  scale,  a  s  *  {:#  a*x  ||  x  <-  s}; 
y,=  a*sl  :t  a*s2  :i  a*s3  :i  ... 

y,y,y,  the  integers  based  on  recursive  stream  mapping 
typeof  intsl  =  list  M; 
intsl  =  {def  sue  x  =  x+1; 

ints  s  1  :i  smapl  sue  ints 
in  ints}; 


There  is  no  comprehension  for  the  recurrence  intsl.  Consider  the  following  function  for 
numerical  integration; 


y,y«y.  integrate  a  function  using  the  rectangle  rule 
typeof  integrate  *  (N  ->  N)  ->  N  ->  N  ->  N  ->  N; 
def  integrate  f  lo  hi  dx  » 

{  xs  =  smapl  ((+)  lo)  (scale  dx  (upfrom  0)); 
ys  3  smapl  f  xs; 
areas  »  smapl  ((*)dx)  ys; 
n  *  fix  ((hi-lo)/dx) ; 
in 

add.first.n  n  areas}; 

typeof  integrate.  =  (N  ->  N)  ->  N  ->  N  ->  N  ->  N; 
def  integrate,  f  lo  hi  dx  s 

{  xs  *  {:#  lo+(dx*x)  1 1  X  <-  upfrom  0}; 
ys  =  {:#  f  X  II  X  <-  xs}; 
areais  =  {:#  dx*y  1 1  y  <-  ys}; 
n  =  fix  ((hi-lo)/dx) ; 
in 

add.first.n  n  areas}; 


It  is  interesting  to  remove  the  intermediate  streams  from  these  definitions. 


typeof  integrate.!. liner  *  (N  ->  N)  ->  N  ->  N  ->  N  ->  N; 
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def  integrate. l.liner  f  lo  hi  dx  * 

{  areas  =  smapl  ((•)  dx) 

(smapl  f 

(smapl  ((♦)  lo) 

(scale  dx  (upfrom  0)))); 

n  =  fix  ((hi-lo)/dx) ; 
in 

add.first.n  n  areas}; 

typeof  integrate. l.liner.  »  (N  ->  H)  ->  N  ->  M  ->  K  ->  N; 
def  integrate. l.liner.  f  lo  hi  dx  * 

{  eureas  =  {:#  dx  *  f(lo+(dx*x))  | |  x  <-  upfrom  0}; 
n  =  fix  ((hi-lo)/dx) ; 
in 

add.first.n  n  areas}; 

The  mapping  version  gets  cumbersome  quickly,  but  the  comprehension  version  remains 
compact.  We  still  need  to  reduce  the  stream  using  a  separate  mechanism.  We  would  like  the 
compiler  to  have  the  ability  to  remove  intermediate  lists  and  streams  and  compile  integrate. 
as  efficiently  as  integrate_lJ.iner.,  but  we  will  not  deal  with  that  topic  in  this  thesis.  We 
continue  developing  our  stream  library. 

VIX  map  a  binary  function  over  two  streams 

typeof  8map2  =  (*0  ->  *1  ->  *2)  ->  (list  *0)  ->  (list  *1)  ->  (list  *2); 
def  8map2  f  (xtxs)  (yiys)  »  f  x  y  :#  8map2  f  xs  ys 
I . . sfflap2  f  X  y  s  nil ; 

def  smap2.  f  xs  ys  »  {;#  f  x  y  II  (x,y)  <-  lazy.zip2  xs  ys}; 

zip2,  which  is  part  of  the  standard  environment,  turns  a  pair  of  lists  into  a  list  of  pairs. 
We  must  define  library  functions  to  do  lazy  zipping. 

zip  two  streams  lazily 

def  lazy.zip2  (x:xs)  (y.ya)  »  (x,y)  :i  lazy.zip2  xs  ys 
|..lazy.zip2  x  7  ■  nil; 

If  we  used  zip2  instead  of  lazy.zip2  in  8map2,  although  the  result  would  be  generated 
lazily  the  intermediate  list  of  x’s  and  y’s  would  be  generated  eagerly. 

*/,%'/•  first  order  recurrence  given  a  seed  and  a  recurrence  relation 
typeof  sgenl  »  (*0  ->  *0)  ->  *0  ->  (list  *0) ; 
def  sgenl  f  xO  >  xO  :f  sgenl  f  (f  xO) ; 

yXX,  the  integers  as  a  first  order  recurrence 
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typeof  ints2  -  list  N; 
ints2  *  sgenl  ((+)!)  1; 

%y,y,  second  order  recxirrence  given  two  seeds  and  a  recurrence  relation 
typeof  sgen2  =  (*0  ->  *0  ->  *0)  ->  *0  ->  *0  ->  (list  *0) ; 
def  sgen2  f  zO  xl  =  xO  :t  sgen2  f  xl  (f  xO  xl); 

y,y  the  Fibonacci  numbers  as  a  second  order  recurrence 
typeof  fibs  =  N  ->  H  ->  (list  M) ; 
def  fibs  a  b  s  sgen2  (.*)  a  b; 

%%%  filter  a  stream  by  a  predicate 

typeof  sf ilter  =  (*0  ->  B)  ->  (list  *0)  ->  (list  *0) ; 
def  sfilter  p  nil  =  nil 
I  sfilter  p  (x:xs)  a  if  p  x  then  x  :i  sfilter  p  xs 

else  sfilter  p  xs; 

def  sfilter,  ps={:ix  ||  x<-s  when  p  x}; 


prefix  computation  (partiad  products)  on  a  stream 
typeof  sexpand  *  (*0  ->  *0  ->  *0)  ->  *0  ->  (list  *0)  ->  (list  *0) ; 
def  sexpand  f  z  nil  »  nil 

I  sexpand  f  z  (x:xs)  a  {  val  a  f  z  x 

in 

val  :i  sexpand  f  val  xs}; 

'///,%  stream  of  all  triangle  numbers 

typeof  triangles  a  list  N; 

def  triangles  a  sexpand  (+)  0  ints; 

yyy.  incremental  dot  product  of  two  sequences 
typeof  dot  *  (list  N)  ->  (list  N)  ->  (list  N) ; 
def  dot  si  s2  a  sexpand  (■•■)  0  (smap2  (*)  si  s2) ; 

Streams  have  eager  heads  and  lazy  tails,  and  we  can  implement  this  behavior  precisely 

in  our  model.  It  is  difficult  and  often  impossible  for  a  compiler  to  make  such  optimizations 
automatically  (in  this  case  to  recognize  that  the  heads  can  be  evaluated  eagerly).  This  problem 
is  handled  explicitly  in  our  system,  i.e.,  not  automatically,  but  under  programmer  control. 

It  is  interesting  to  note  which  programs  have  a  comprehension  syntax.  Comprehension 
handles  simple  generation  and  transforming,  but  not  recurrence,  compression,  or  expansion 
(scanning). 

A  problem  elegantly  solved  by  streams  is  Hamming’s  problem  [27],  generating  the  ordered 
list  of  integers  Tf  containing  exclusively  2’s,  3’s,  and  5’s  as  factors. 
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The  following  recursive  fact  leads  to  the  clever  solution  below; 


n  =  {1}U2WU3HU5-W 

typeof  merge  =  (list  K)  ->  (list  N)  ->  (list  N) ; 
def  merge  (z:zs)  (y:ys)  «  X  merge  tso  ordered  streams;  no  dups. 
if  z<y  then  z  :f  merge  zs  (y:ys) 

else  if  z>y  then  y  :i  merge  (z:zs)  ys 

else  z  :i  merge  zs  ys; 

typeof  hamming  =  list  N; 
hamming  ^ 

{  h235  =  1  :  merge  (scale  2  h235)  h35  ; 

h35  s  3  ;  merge  (scale  3  h35  )  h5  ; 

hS  s  5  :  scale  5  hS 

in 

h235>; 

This  is  our  first  example  where  conses  with  various  degrees  of  laziness  have  been  mixed  and 
matched,  merge  depends  on  laziness  to  break  an  infinite  recursion,  but  hamming  can  safely  call 
the  normal  cons  function,  a  more  efficient  call,  ff  we  replaced  the  symbols  in  hamming  with 
“ :  #”  symbols,  the  program  would  behave  identically,  although  slightly  less  efficiently. 

If  we  switched  the  stream-conses  in  merge  with  the  list-conses  in  hamming,  the 

program  would  diverge.  How  did  we  know  when  to  use  stream-cons  and  when  to  use  list-cons? 
The  conses  in  hamming  are  only  executed  once,  while  the  conses  in  merge  break  a  recursion. 

No  collection  of  stream  programs  would  be  complete  without  a  sieve  of  Eratosthenes  for 
generating  primes. 

7.y.y,  THE  SIEVE  OF  ERATOSTHENES 

typeof  remove.multiples  »  N  ->  (list  N)  ->  (list  N) ; 

def  remove.multiples  e  s  >  sfilter  {f\m  z  ■  0<>(raii  z  e)}  s; 

typeof  sieve  =  (list  M)  ->  (list  M); 

def  sieve  (z:zs)  «  z  :•  sieve  (ronove.multiples  z  zs); 

typeof  primes  »  list  N; 
primes  *  sieve  (upfrom  2) ; 


Or  we  can  do  it  another  way. 


y.7.y.  THE  PRIMES  by  FILTERING  THE  INTEGERS 

*/•*/•'/•  Use  the  prime  stream  being  generated  to  filter  [3  5  7  . . .] 

typeof  prime.test.given  =  (list  N)  ->  N  ->  B; 
def  prime.test.given  (p:ps)  x  * 

if  (p*p)  >  X  then  true  7,  tested  to  root  x;  prime 

else  if  0  ==  (rem  x  p)  then  false  7.  a  composite 
else  prime.test.given  ps  x; 

typeof  primes 1  =  list  N; 
primes 1  = 

{primes  =  2  :i  sfilter  (prime.test.given  primes)  (upfrom  3  by  2) 
in  primes}; 


The  prime  test  is  strictly  a  consumer,  ignorant  of  the  fact  that  delaying  may  be  going  on. 
The  programs  for  generating  the  prime  streams  each  require  a  single  #. 

A  lazy  system  does  not  distinguish  between  lists  and  streams.  We,  on  the  other  hand,  con¬ 
sume  streams  and  lists  similarly,  but  must  produce  them  differently.  By  requiring  annotations, 
separate  libraries  (containing  programs  such  as  map)  for  lists  and  streams  must  be  used. 

4.1.2  Other  Lazy  Data-Structures 

In  this  section  we  consider  two  other  data  structures  that  can  be  constructed  lazily,  trees  and 
memoization  tables. 

Trees 

We  define  a  tree  as  a  leaf  or  a  node  with  a  value  and  two  sub-trees.  We  also  define  a  function 
which  lazily  maps  a  unary  operator  over  the  elements  of  a  tree  ,  a  function  to  add  up  the 
numbers  in  the  top  few  levels  of  a  tree  of  numbers,  and  a  sample  infinite  tree  of  numbers. 

type  tree  *0  «  leaf  I  node  *0  (tree  *0)  (tree  *0) ; 

typeof  map.tree  =  (*0  ->  *0)  ->  (tree  *0)  ->  (tree  *0) ; 
def  map.tree  f  leaf  >  leaf 
I  map.tree  f  (node  x  tl  t2)  « 
node  (f  x)  (Nmap.tree  f  tl)  (fmap.tree  f  t2) ; 

typeof  add.n.levels  *  N  ->  (tree  N)  ->  N; 
def  add.n.levels  n  leaf  »  0 
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I  add_n_levels  0  (node  x  tl  t2)  =  0 
1 . .add_n_levels  n  (node  x  tl  t2)  = 

X  +  add_n_levels  (n-1)  tl  +  add_n_levels  (n-1)  t2; 

typeof  fimny_tree  =  tree  N; 
funny.tree  = 

{t  =  node  1  (map.tree  ((+)!)  t)  (map.tree  {fun  x  =  x*x}  t) 
in 

t}; 

The  tree  constructor  map.tree  requires  a  single  annotation  to  delay  the  expansion  of  each 
lazy  child.  But  add_n_levels,  a  tree  traversing  procedure,  requires  none. 

This  scenario  covers  a  wide  variety  of  “AI  search  problems”  such  as  mini-max  and  game 
searching,  where  the  item  stored  at  the  node  describes  a  situation  reachable  from  the  parent  in 
one  step.  While  the  search  tree  may  be  finite,  it  may  be  enormous  (for  example,  the  number 
of  reachable  chess  positions),  but  only  a  small  portion  of  it  need  be  expanded  at  any  time. 
Searching  large  spaces  fits  into  the  “infinite  structures”  programming  paradigm. 


Memoized  Functions 

Suppose  we  have  a  function  over  a  small  discrete  domain  that  is  expensive  to  compute,  and 
suppose  that  we  are  not  likely  to  need  the  value  of  the  function  for  all  values  of  the  domain,  but 
we  are  likely  to  need  some  values  several  times.  In  such  a  situation  If  it  may  be  worthwhile  to 
perform  a  small  amount  of  work  for  each  element  of  the  domain  in  order  to  save  computing  a 
few  values  in  the  range.  The  following  routine  memoizes  any  function  over  the  specified  integer 
range.  Notice  that  the  result  is  a  function. 

typeof  menoize  =  (N  ->  *0)  ->  (N,H)  ->  (N  ->  *0); 
def  memoize  f  (lo.hi)  = 

{  memo.array  =  {array  (lo.hi) 

I  [i]  #  f  i  I  I  i  <-  lo  to  hi}; 
def  memo.f  i  *  memo.array [i] ; 
in 

memo.f } ; 

This  example,  clearly  in  the  “expensive  slots”  programming  paradigm,  allows  the  program¬ 
mer  to  trade  some  implicit  bookkeeping  for  potential  efficiency.  “Why”,  one  might  ask,  “doesn’t 
the  programmer  simply  manage  the  bookkeeping  explicitly?  Just  keep  a  table  of  flags  indicat¬ 
ing  whether  or  not  the  slot  has  been  evaluated,  and  check  that  first  before  computing.”  No 
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problem,  so  far.  But  what  happens  when  you  need  to  change  the  value  of  a  flag?  That  facility 
is  not  available  in  a  functional  language.  The  technique  demonstrated  in  this  section  allows  the 
programmer  to  associate  the  needed  information  with  each  slot,  but  no  more. 

4.1.3  Search 

We  have  already  mentioned  a  class  of  search  problems  which  can  be  viewed  as  explicit  tree 
walks.  Now  we  consider  a  couple  of  specific  search  problems. 

Eight  Queens 

The  eight  queens  problem  [47]  can  be  solved  as  a  search  problem.  The  problem  is  to  place 
eight  queens  on  a  chess  board  so  that  no  queen  can  capture  another  queen  in  a  single  move. 
Algorithms  to  generate  the  ninety-two  solutions  are  well  known.  The  idea  is  to  build  solutions 
one  column  at  a  time.  A  partial  solution  is  an  eight  row  by  i  column  board  (i  <  8).  Start 
with  an  empty  zero  column  solution,  and  extend  all  solutions  with  i  —  1  columns  to  i  columns. 
Partial  solutions  with  i  columns  are  represented  as  partial  permutations  (lists  of  length  i)  of 
the  integers  from  zero  to  seven.  Each  integer  indicates  the  corresponding  queen’s  row. 

To  generate  all  solutions,  the  program  below  uses  non-strict  data-structures.  Laziness  is  not 
necessary.  L!  i  selects  the  ith  element  from  a  list  L  (zero  origin  is  used:  L!0  is  the  first  element 
of  list  L). 

The  checks  procedure  determines  if  adding  a  queen  in  row  q  will  endanger  the  queen  already 
in  column  i  of  partial-board  board. 

typeof  checks  =  N  ->  (list  N)  ->  N  ->  B; 
def  checks  q  board  1  = 

{  board.i  =  board!! 
in 

q  ==  boaurd.i  or  aba  (q-board_i)  =-  i+1}; 

The  safe  procedure  determines  if  it  is  safe  to  add  a  queen  in  row  q  to  partial-board  board, 
safe  calls  checks  on  each  column  of  board. 

typeof  safe  =  N  ->  (list  N)  ->  B; 

def  safe  q  board  =  foldr_list  (and)  true 

not  (checks  q  board  i) 

II  i  <-  0  to  (length  board)-!}; 
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The  queens  procedure  sets  up  the  partial  solution  table  boeirds,  with  a  single  empty  solution 
with  zero  columns.  Ill  solutions  with  i-1  columns  are  extended  to  i  columns  for  all  safe 
extensions. 

typeof  queens  =  N  ->  (list  (list  N)); 
def  queens  n  = 

•C  boards  =  {array  (0,n) 

I  [0]  =  nil  mil 

I  [i]  =  extend  i  ||  i  <-  1  to  n}; 
def  extend  i  =  {;  qrboard  ||  board  <-  boards[i-l] 

t  q  <-  0  to  7 
when  safe  q  board} 
in 

boards [n]  } ; 

The  reason  we  do  not  need  laziness  is  that  the  search  space  is  explored  in  its  entirety. 
Generating  the  first  ten  solutions,  however,  is  not  as  simple,  since  we  do  not  wish  to  exhaust 
the  search  space.  Nonetheless,  we  can  find  the  solutions  by  replacing  the  list  enumeration  in 
queens  by  stream  enumeration.  Then,  only  the  desired  number  of  solutions  will  be  produced. 

typeof  lazy.queens  =  N  ->  (list  (list  N)); 
def  lazy.queens  n  « 

{  boards  =  {array  (0,n) 

I  [0]  =  nil  mil 

I  [i]  =  extend  i  ||  i  <-  1  to  n}; 
def  extend  i  =  {:i  q:board  li  board  <-  boards[i-l] 

ft  q  <-  0  to  7 
when  seife  q  board} 
in 

boards [n] } ; 


The  Paraffins 

Turner  popularized  the  paraffins  problem  to  demonstrate  the  power  of  lazy  evaluation  and 
higher-order  programming.  Since  we  will  refer  to  this  problem  several  times  in  the  remainder 
of  the  thesis,  this  section  is  devoted  to  defining  the  problem. 

A  paraffin  is  a  molecule  with  structural  formula  Cnff2n+2-  Paraffins,  also  known  as  the 
alkanes,  are  acyclic  and  have  no  “double  bonds”.  Methane  {CH4)  and  iso-butane  iC4Hio)  are 
drawn  in  Figure  4.1. 

A  paraffin  is  isomorphic  to  an  acyclic  undirected  graph,  with  the  carbon  atoms  mapping  to 
internal  nodes,  and  the  hydrogen  atoms  mapping  to  leaf  nodes.  All  internal  nodes  have  degree 
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methane  iso-butane 

Figure  4.1:  Two  Paraffins 

four,  and  all  leaf  nodes  have  degree  one.  The  graphs  corresponding  to  methane  and  iso-butane 
would  look  exactly  like  the  molecules  in  4.1. 

Without  losing  any  information,  we  can  ignore  the  hydrogen  atoms  and  the  corresponding 
leaf  nodes,  and  map  a  paraffin  to  an  acyclic  undirected  graph  of  bounded- degree  four.  Hydrogen 
atoms  (leaf  nodes)  are  implied  to  bring  the  valence  (degree)  of  each  carbon  atom  (internal 
node)  to  four.  This  common  simplification  makes  the  pictures  less  cluttered  and  is  sometimes 
convenient  when  discussing  some  aspects  of  the  problem. 

c 

I 

c  c — c — c 

methane  iso-butane 

Figure  4.2:  Two  Simplified  Paraffins 

If  a  graph  isomorphism  exists  between  the  graphs  corresponding  to  two  molecules,  the 
molecules  are  said  to  be  equivalent.  If  no  graph  isomorphism  exists  between  the  graphs  corre¬ 
sponding  to  two  molecules  with  the  same  number  of  carbon  aton:  >,  the  molecules  are  said  to 
be  isomers  (structurally  different).  In  Figure  4.3,  all  three  molecules  have  the  same  structural 
formula  {CeHn).  However,  (a)  and  (b)  are  equivalent,  and  (c)  is  an  isomer  of  (a)  and  (b). 

A  sub-problem  of  the  paraffins  problem  involves  paraffin  radicals  (radicals,  for  short),  sub¬ 
molecules  of  paraffins.  A  radical  is  a  molecule  with  structural  formula  a  paraffin  with 

one  hydrogen  atom  missing.  Alternatively,  any  bond  in  a  paraffin  can  be  broken  to  produce 
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(a)  (b)  ^  (c) 

Figure  4.3:  Some  Isomers  of  CsHu 

two  radicals. 

A  radical  is  isomorphic  to  an  unordered  ternary  tree.  The  hydrogen  atoms  correspond  to 
the  leaves  of  the  tree.  Both  the  root  and  the  internal  nodes  have  three  children,  and  children 
are  unordered.  Two  radicals  are  equivalent  if  their  corresponding  trees  have  an  isomorphism, 
methyl  (CH3)  and  Ethyl  {C2H5)  are  drawn  in  Figure  4.4. 

—  c  — c — c 

methyl  ethyl 

Figure  4.4:  Two  Radicals 

We  will  refer  to  the  size  of  a  radical  or  paraffin  as  the  number  of  carbon  atoms  contained 
in  it. 

The  problem  is  to  enumerate,  without  repetition,  the  paraffins  up  to  a  certain  size.  The 
answer  should  list  the  paraffins  of  size  one,  the  paraffins  of  size  two,  and  so  on,  up  to  paraffins 
of  some  specified  size.  “Without  repetition”  is  the  tricky  part.  Just  as  it  is  tricky  to  twist  two 
paraffins  around  to  see  if  they  line  up  (as  we  saw  in  Figure  4.3),  a  program  must  not  output 
two  equivalent  paraffins. 

This  defines  the  abstract  problem.  Before  we  get  into  solutions,  however,  we  give  an  indi¬ 
cation  of  the  complexity.  To  do  this,  we  must  get  into  issue  of  representation.  Radicals  can 
be  represented  by  ternary  trees  trees,  radical  is  defined  as  a  new  algebraic  type  as  follows. 
Although  a  hydrogen  is  not  normally  considered  a  radical  (radical  of  size  zero?:  CqHi),  it  is 
convenient  for  us  to  do  so. 

type  radical  =  H  I  Rad  radical  radical  radical; 

Representing  paraffins,  however,  is  not  as  easy.  Radicals  have  a  root  from  which  to  start, 
but  paraffins  have  no  such  distinguished  nodes.  We  can  simply  pick  a  node,  and  then  hang  four 
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radicals  from  that  node  as  follows; 


type  paraffin  =  Para  radical  radical  radical  radical; 

But  if  we  pick  a  node  randomly,  there  are  many  representations  for  a  given  paraffin.  We 
have  even  more  choices  than  that,  as  the  children  are  unordered.  So,  we  see  that  even  our 
representation  for  radicals  is  not  unique,  as  the  sub-radicals  are  unordered,  too.  The  molecule 
pictured  in  (a)  and  (b)  of  Figure  4.3  has  108  representations!  Seventy-two  representations 
correspond  to  choosing  an  “outer  carbon”  as  the  starting  point  (or  paraffin-origin),  and  thirty- 
six  representations  correspond  to  choosing  an  “inner  carbon”  as  the  paraffin-origin.  The  number 
of  representations  of  paraffins  and  radicals  is  exponential  in  the  size. 

Now,  we  have  a  handle  on  the  problem  definition  and  complexity.  We  go  on  to  consider  two 
solutions. 


Turner’s  Paraffins  Program 

Turner’s  algorithm  generates  radicals  recursively  by  complete  induction^.  A  radical  of  size  n  has 
three  smaller  sub- radicals  whose  size  adds  up  to  n  —  1.  By  making  the  restriction  that  the  size 
of  the  first  sub-radical  is  less  than  or  equal  to  the  size  of  the  second  sub-radical,  whose  size  is 
less  than  or  equal  to  the  size  of  the  third  sub-radical,  many  of  the  duplicate  representations  can 
be  a'/oided  for  radicals^.  This  restriction  also  means  that  the  size  of  the  first  sub-radical  can  not 
be  more  than  a  third  the  total  sizes  of  the  sub-radicals,  and  a  corresponding  fact  for  the  second 
sub-radical.  Although  this  implied  restriction  does  not  change  the  number  of  representations, 
it  mak<;s  the  generation  more  efficient.  An  array  of  radicals  ru.nging  in  size  from  zero  to  n  is 
generated  as  follows: 

typeof  rad.array  =  N  ->  (array  (list  radical)); 
def  rad_au:ray  n  = 

{  rads  =  {  eirray  (0,n) 

I  [0]  =  H;nil 

I  [i]  =  radgen  i  II  i  <-  1  to  n>; 
def  radgen  n  =  f :  Rad  a  b  c  I  I 

i  <-  0  to  floor((n-l)/3) 
t  j  <-  i  to  floor((n-l-i)/2) 
t  a  <-  rads [i] 
ft  b  <-  radsCj] 

^By  complete  induction  we  mean  that  the  objects  of  size  n  depend  on  objects  of  sizes  less  than  n. 

®The  first  duplicate  occurs  at  size  seven;  the  proof  is  left  as  an  exercise  to  the  reader. 


81 


in 


ft  c  <-  rads[n-l-i-j]} 


rads> ; 

The  rad.array  procedure  has  t.vo  bindings,  rads,  a  memoization  array  for  lists  of  radicals 
of  each  size,  is  generated  with  an  array  comprehension,  radgen,  which  depends  on  the  existence 
of  radicals  of  sizes  zero  to  i  —  1  to  generate  the  radicals  of  size  i,  generates  radicals  using  a  list 
comprehension  that  has  generators  five  levels  deep,  a,  b,  and  c  are  the  sub-radicals,  i,  j,  and 
n-l-i-j  are  the  corresponding  sizes  of  the  sub-radicals,  i  ranges  over  all  possible  sizes  for  the 
first  sub-radical,  and  a  ranges  over  all  radicals  of  size  i.  Similarly  for  j  and  b.  c  ranges  over  all 
radicals  of  whatever  carbon  atoms  remains.  It  should  be  clear  that  all  radicals  are  generated, 
although  not  uniquely. 

That’s  the  easy  part  of  the  algorithm. 

Our  representations  are  not  unique.  Non-unique  representations  makes  it  difficult  to  tell  if 
two  radicals  or  paraffins  are  equivalent.  And  if  we  ever  generate  the  same  paraffin  more  than 
once,  we  have  to  discard  all  but  one. 

When  we  introduced  the  radicals,  we  noted  that  we  can  get  a  radical  by  starting  with  a 
paraffin  and  “breaking  ofP  a  hydrogen  atom.  Similarly,  we  can  generate  a  paraffins  of  size  n 
by  starting  with  a  radical  of  size  n  and  “completing  it”  by  adding  a  hydrogen  atom.  Turner’s 
algorithm  actually  adds  a  methyl  group  {CH^)  to  a  radical  that  is  one  size  smaller  than  the 
size  of  the  desired  paraffin. 

A  single  paraffin  can  be  generated  by  attaching  a  methyl  group  to  to  several  distinct  radicals. 
This  has  nothing  to  do  with  representation.  The  paraffin  in  (a)  and  (b)  of  Figure  4.3  can  only 
be  generated  in  one  way  by  this  method.  Chopping  off  any  methyl  group  leaves  an  identical 
paraffin.  The  paraffin  in  (c)  of  Figure  4.3  can  be  generated  in  two  ways.  We  can  chop  off  a 
methyl  group  from  a  long  or  short  “chain”  and  leave  two  different  radicals.  It  should  be  clear 
that  if  we  have  all  radicals  of  size  i  -  1,  we  can  generate  all  paraffins  of  size  i  by  attaching 
methyl  groups  to  the  radicals. 

Now  we  have  a  problem.  We  can  generate  all  paraffins  of  a  given  size,  but  we  wiU  have 
repetitions.  And  since  the  representations  are  not  unique,  it  is  non-trivial  to  filter  out  duplicates. 

Turner  deals  with  the  non-uniqueness  issue  is  by  defining  equivalence  classes  for  paraffin  rep- 
re.sentations.  Two  representations  of  paraffins  are  in  the  same  equivalence  class  if  the  paraffins 
are  equivalent. 
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The  three  procedures  (or  laws)  rotate,  swap,  and  invert  take  one  representation  for  a 
paraffin  and  return  another  representation  for  the  same  paraffin,  rotate  rotates  the  paraffin’s 
radicals  (the  radicals  at  the  top  level),  swap  exchanges  the  paraffin’s  first  two  radicals,  invert 
shifts  the  paraffin-origin  to  the  root  of  the  paraffin’s  first  radical.  The  laws  use  pattern  matching 
to  destructure  their  arguments. 

typeof  laws  =  list  (paraffin  ->  paraffin) ; 
laws  =  rotate: swap: invert ;nil; 

typeof  rotate  =  paraffin  ->  paraffin; 
def  rotate  (Para  a  b  c  d)  =  Paora  b  c  d  a; 

typeof  swap  =  paraffin  ->  paraffin; 
def  swap  (Para  a  b  c  d)  =  Para  b  a  c  d; 

typeof  invert  =  paraffin  ->  paraffin; 
def  invert  (Para  H  b  c  d)  =  Para  H  b  c  d 
I  invert  (Para  (Rad  x  y  z)  b  c  d)  =  Para  x  y  z  (Rad  bed); 

If  applied  in  the  correct  order,  these  three  laws  can  take  any  representation  of  a  given 
paraffin  into  any  any  other  representation  for  the  same  paraffin.  The  closure  of  these  laws  over 
a  singleton  set  containing  a  paraffin  is  the  set  of  all  representations,  he.,  the  equivalence  class. 
This  may  or  may  not  be  obvious. 

The  following  procedures  generate  the  equivalence  class  of  a  paraffin  by  taking  the  closure 
under  the  laws.  “++”  is  the  list  append  operator. 

typeof  equivclass  =  paraffin  ->  (list  paraffin); 
def  equivclass  p  =  closure_under_laws  laws  (p:nil); 

typeof  closure_under_laws  =  (list  (paraffin  ->  paraffin))  -> 

(list  paraffin)  ->  (list  paraffin) ; 
def  closure_under_laws  laws  s  =  s  ++  closurel  laws  s  s; 

typeof  closurel  = 

(list  (paraffin  ->  paraffin))  -> 

(list  paraffin)  ->  (list  paraffin)  ->  (list  paraffin); 
def  closurel  laws  s  t  = 

closure2  laws  s  (mkset  {:  p  II  law  <-  laws 

ftp  <-  map.li,'*  law  t 
unlesj  member?  .==)  p  s}) ; 


typeof  closure2  = 

(list  (paraffin  paraffin))  -> 
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(list  paraffin)  ->  (list  paraffin)  ->  (list  pzuraffin); 
def  closure2  laws  s  nil  =  nil 
|..closure2  laws  s  t  =  t  ++  closurel  laws  (8++t)  t; 

The  mkset  procedure  (used  in  closure2  above)  removes  duplicates  from  a  list. 

typeof  mkset  =  (list  paraffin)  ->  (list  pao-affin); 
def  mkset  nil  =  nil 

I  mkset  (x:xs)  =  x  ;  y  1 1  y  <-  xs  imless  x==y}; 

Using  equivclass,  testing  paraffin  equivalence  is  straightforward.  The  member  library  rou¬ 
tine  takes  an  equality  test,  an  element,  and  a  list,  and  determines  if  the  element  is  in  the 
list. 

typeof  equiv  =  paraffin  ->  paraffin  ->  B; 
def  equiv  a  b  =  member?  (==)  b  (equivclass  a) ; 

Given  the  ability  to  test  for  equivalence,  the  quotient  procedure  takes  an  equivalence 
relation  and  a  list,  and  returns  a  list  with  no  two  equivalent  elements. 

typeof  quotient  -  (paraff in->paraff in->B)-> 

(list  paraffin)  ->  (list  paraffin) ; 
def  quotient  f  nil  *  nil 

I  quotient  f  (a:x)  =  a  :  b  I  I  b  <-  quotient  f  x  xmless  f  a  b}; 

Finally,  we  can  generate  the  radicals  of  size  n  —  1,  slap  on  methyl  groups,  and  filter  out 
equivalent  representations. 

typeof  paragen  =  N  ->  (list  parerffin); 
def  paragen  n  = 

{  radicals  =  rad.array  (n-1); 
rads  =  radicals  [n-l] 
in 

quotient  equiv  {:  Para  r  H  H  H  II  r  <-  rads}}; 

Turner’s  solution  uses  higher  order  procedures  and  abstraction  in  a  big  way.  In  case  you 
didn’t  notice,  the  Id#  programs  in  this  section  have  no  “#”s  in  them,  t.e.,  it  is  not  necessary  to 
use  lazy  data-structures.  The  fact  that  closure_under_laws  can  be  executed  eagerly  in  a  lazy 
functional  language  with  no  penalty  is  not  obvious,  though  (perhaps  it  is?).  This  fact  is  very 
difficult  to  deduce  automatically.  We  discuss  this  issue  in  Section  5.2.  The  solution  does  make 
heavy  use  of  non-strict  data-structures  to  expose  parallelism,  however. 
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If  the  problem  were  modified  slightly,  then  laziness  would  come  in  handy.  If  the  problem 
were,  for  example,  to  produce  ten  paraffins  of  a  particular  size,  we  wouldn’t  like  to  produce 
aU  the  paraffins  of  that  size,  or  even  all  the  radicals  of  smaller  sizes.  But,  how  much  laziness 
is  required?  It  turns  out,  that  we  need  simply  convert  the  list  enumerations  in  radgen  (  in 
rad_array)  and  in  paragen  to  stream  enumeration,  and  we  have  a  “lazy  version”.  It  is  still 
true  that  executing  closurG_under_laws  eagerly  causes  no  penalty,  but  it  is  even  less  obvious 
and  no  ea.sier  to  deduce  automatically.  We  discuss  this  issue  in  Section  5.2. 

Turner’s  solution  is  exponentially  inefficient.  By  changing  the  laws  a  little  bit,  and  taking 
advantage  of  the  fact  that  canonical  forms  exist,  a  solution  of  the  same  style  (equivalence  class, 
closure,  etc.)  can  be  developed.  For  certain  canonical  forms,  it  turns  out  that  even  more 
efficient  solutions  exist,  solutions  whose  running  time  is  linear  in  the  size  of  the  output  [6]. 

An  Efficient  Paraffins  Program 

This  section  presents  an  efficient  solution  (linear  in  the  size  of  the  output,  which  makes  it 
asmytotically  optimal)  for  the  paraffins  problem.  A  more  detailed  description  of  this  solution 
can  be  found  in  [6].  This  solution  uses  no  higher  order  functions  or  laziness. 

The  first  step  is  to  establish  a  total  order  on  radicals.  This  is  fairly  easy  (since  radicals  have 
a  reference  point,  the  root),  and  leads  directly  to  a  canonical  form  for  radicals.  A  radical  is  in 
canonical  form  if  its  sub-radicals  are  in  ascending  order,  and  if  the  sub- radicals  are  in  canonical 
form.  We  will  not  give  the  details  of  the  total  ordering  and  canonical  form  here.  Suffice  it  to 
say  that  the  gen_radicals  procedure  below  generates  radicals  completely  and  uniquely.  The 
tails  procedure  produces  the  list  of  all  prefixes  of  a  list.  The  radical  type  is  redefined  to 
memoize  the  size. 


def  tails  nil  ®  nil 
I  tails  (a:as)  =  (=>:as)  :  tails  as; 

type  radical  =  H  I  Rad  N  radical  radical  radical; 


def  gen.radicals  w  = 

{  def  rgen  wpl  = 

{ :  Rad  wpl  rl  r2  r3  | | 
i  =  wpl-1 

ft  wl  <-  0  to  fix(w/3) 

ft  rl:rltl  <-  tails  radicals [wl] 
ft  w2  <-  wl  to  f ix((w-wl)/2) 
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ft  r2:r2tl  <-  tails  (if  tfl<w2  then  radicals [«2]  else  rlrrltl) 
ft  w3  =  w-wl-w2 

ft  r3  <-  if  h2<w3  then  radicals [w3]  else  r2:r2tl}; 
radicals  =  {  array  (0,w) 

I  [0]  =  H  :  nil 
I  [i]  =  rgen  i  | I  i<-l  to  w} 
in  radicals}; 

Now  we  need  a  Ccinonical  form  for  parafRns.  The  graphs  that  model  paraffins  are  called  free 
trees.  Knuth  (Volume  I)  discusses  the  enumeration  (i.e.,  unique  enumeration)  of  free  trees  [28]. 
The  trick  is  to  realize  that  free  trees  (paraffins)  have  a  well  defined  centroid,  which  is  either  one 
node  or  a  pair  of  adjacent  nodes.  The  centroid  is  like  the  center  of  mass.  We  define  two  types 
of  paraffins,  BCPs  (bond  centered  paraffins)  and  CCPs  (carbon  centered  paraffins)  corresponding 
to  the  two  cases. 


type  paraffin  =  BCP  radical  radical  I 

CCP  radical  radical  radical  radical; 

BCPs  and  CCPs  partition  the  paraffins.  Canonical  form  for  paraffins  is  achieved  if  the  radicals 
of  BCPs  and  CCPs  are  kept  in  order  and  in  canonical  form. 

The  paragen  procedure  generates  the  paraffins  completely  and  uniquely,  bcpgen  generates 
the  bond  centered  paraffins,  and  ccpgen  generates  the  carbon  centered  paraffins. 


def  paragen  radicals  n  = 

{  def  bcpgen  w  =  if  (0  <>  remainder  w  2)  then  nil 
else  {:  BCP  rl  r2  II 

rl:rltl  <-  tails  radicals [fix(H/2)] 
ft  r2  <-  rl:rltl}; 

def  ccpgen  w  = 

{:  CCP  rl  r2  r3  r4  II 

wl  <-  0  to  fix((w-l)/4) 

ft  rl:rltl  <-  tails  radicals [wl] 

ft  w2  <-  wl  to  f ix((w-l-wl)/3) 

ft  r2:r2tl  <-  tails  (if  wl<w2  then  radic8ds[w2]  else  rl:rltl) 
ft  w3min  =  (fix  (w/2))-wl-w2 

ft  w31o  3  max  w2  w3min 

ft  w3  <-  w31o  to  fix((w-l-wl-w2)/2) 

ft  r3:r3tl  <-  tails  (if  w2<w3  then  radicads[w3]  else  r2:r2tl) 
ft  w4  =  w-l-wl-w2-w3 
ft  r4  <-  if  w3<w4  then  radicals [w4] 
else  r3:r3tl} 
in  bcpgen  n,  ccpgen  n}; 
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This  solution  does  not  require  any  laziness  or  non-strictness.  In  fact,  it  can  be  translated  into 
Fortran  in  a  straightforward  way.  E.xtracting  parallelism  from  the  Fortran,  however,  would  be 
very  difficult.  Consider  the  gen_radicals  procedure,  which  enumerates  the  radicals  of  sizes  0  to 
n.  While  the  procedure  can  be  executed  in  well  defined  phases,  strict  semantics  would  severely 
limit  the  available  parallelism.  (We  explore  this  in  Section  5.1.)  Id’s  non-strict  semantics, 
however,  harnesses  all  the  parallelism  available.  This  is  an  excellent  example  of  where  laziness 
provides  non-strictness,  but  all  that  is  needed  is  the  less  restrictive  and  cheaper  non-strictness. 

If  we  took  up  the  modification  to  Turner’s  original  problem  statement  as  presented  at  the  end 
of  the  last  section,  it  is  clear  that  we  can  get  a  “lazy  version”  by  converting  the  list  enumerations 
of  rgen,  bcpgen,  and  ccpgen  to  stream  enumeration,  and  we  can  generate  paraffins  one  at  a 
time. 


4.2  What  We  Cannot  Do  with  Lazy  Data- Structures 

As  we  mentioned,  our  system  is  not  as  expressive  as  a  lazy  functional  language.  In  this  section 
we  explore  programs  that  are  difficult  or  impossible  to  code  using  lazy  data-structnres. 

As  pointed  out  by  Henderson  [24],  laziness  cannot  always  be  restricted  to  data  structures  if 
we  wish  to  achieve  lazy  semantics.  The  following  program  relies  on  laziness  in  an  argument  to 
a  procedure: 

typeof  problem  =  N  ->  (list  N) ; 

def  problem  n  *  1  to  n  ++  problem  (n+1); 

The  right  argument  to  append  (“++”)  must  be  delayed  to  break  au  infinite  recursion.  It 
is  straightforward,  however,  to  produce  a  program  that  functions  correctly  in  our  restricted 
system. 


typeof  no.problem  =  N  ->  (list  N) ; 
def  no_problem  n  = 

{  def  lazy.range  lo  hi  = 
if  lo  ==  hi  then  hi: nil 
else  lo  :i  lazy.range  (lo+l)  hi; 
def  no_problem_  n  =  leizy.range  In:#  no_problem_  (n+1); 
def  flattenl  (nil: rest)  =  flattenl  rest 
I  flattenl  ((a:as) :rest)  =  a  :i  flattenl  (as:rest); 
in 

flattenl  (no_problem_  n)}; 
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Suppose  we  call  no_probl8ffl  with  an  input  of  1.  Thunks  will  be  associated  with: 

1.  each  element  of  each  intermediate  list  generated  by  lazy_range, 

2.  each  intermediate  list  as  a  whole  in  no^roblem), 

3.  and  each  element  of  the  result  (“:i”  in  flatten_l). 

Further  suppose  that  the  result  contains  x  elements.  We  will  generate  2x  +  y/x  thunks.  A 
compiler  that  did  a  perfect  job  at  analyzing  problem  would  generate  code  that  would  generate 
the  same  number  of  thunks. 

The  theoretical  minimum  is  one  thunk  per  output  element,  as  we  may  suspend  any  time. 
We  can  achieve  this  in  our  language  by  removing  abstraction. 

typeof  no.abstraction  ®  N  ->  (list  N) ; 
daf  no_ abstract ion  n  = 

{  daf  no_ab8traction_  lo  hi  = 

<  lol.hil  =  if  lo  ==  hi  then  1,  (hi+1)  else  (lo+l),hi 
in 

lo  :i  no_ abstraction,  lol  hil} 
in 

no.abstraction.  1  n}; 

If  we  are  willing  to  allow  some  unwinding,  i.e.,  some  speculation:  generating  several  elements 
at  a  time,  we  can  avoid  even  more  thunks.  We  can  generate  each  intermediate  list  eagerly,  and 
end  up  with  y/x  thunks  for  an  output  stream  of  length  x. 

typeof  speculation  *  N  ->  (list  H); 
daf  speculation  n  = 

•[  daf  speculation,  lo  hi  = 

if  lo  *=  hi  then  hi  :f  speculation.  1  (hi+1) 
else  lo  :  speculation,  (lo+l)  hi; 
in 

speculation.  1  n}; 

In  the  first  solution,  although  there  is  a  lot  of  abstraction,  we  coded  all  the  routines  (in¬ 
cluding  the  library  append  function)  down  to  primitives.  In  the  latter  solutions,  we  avoided 
abstraction,  and  solved  the  problem  more  efficiently.  Abstraction  is  interfering  with  our  lazy 
data-structuring  mechanism.  Abstraction  is  the  haiilmark  of  functional  languages,  and  any 
limitation  on  our  abiUty  to  abstract  is  serious. 
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4.2.1  Lay 


A  real  program  with  this  aforementioned  difficulty  can  be  found  in  the  Miranda  Standard 
Environment.  The  lay  procedure,  when  applied  to  a  list  of  strings,  joins  them  together  after 
appending  a  newline  character  to  each  string.  The  result  is  a  string.  Since  Miranda  defines 
strings  as  lists  (at  some  deep  level),  strings  can  be  partially  defined.  The  desired  behavior 
is  to  avoid  computing  the  result  string  fully  unless  the  entire  string  is  needed.  Consider  the 
translation  of  the  Miranda  version  into  Id: 

typeof  lay  =  (list  S)  ->  S; 
def  lay  nil  =  nil 

I  lay  (a:x)  =  string.concat  a  (string_concat  "\n"  (lay  x)); 

The  desired  lazy  behavior  can  be  achieved  by  rewriting  the  library  routine  string.concat. 
This  is  a  problem,  since  we  don’t  want  the  user  altering  library  functions. 

4.2.2  Equal  Fringe 

The  problem  of  comparing  the  fringes  of  two  trees  for  equality  is  a  famous  demonstration 
of  the  power  of  laziness  with  respect  to  non-infinite  data-structures.  The  following  code  lazily 
converts  the  trees’  fringes  into  streams,  and  then  compares  the  stream  elements  until  a  difference 
is  encountered. 

The  trees  being  compared  may  be  lazily  constructed  or  not,  but  at  least  part  of  the  fringe 
must  be  reachable  in  finite  time. 

type  tree  =  Leaf  N  |  Node  tree  tree; 

typeof  equal.fringe  =  tree  ->  tree  ->  B; 

def  equal_fringe  tl  t2  =  equal_list  (fringe  tl)  (fringe  t2) ; 

typeof  equal.list  =  (list  N)  ->  (list  N)  ->  B; 
def  equal.list  nil  nil  =  true 

I  equal.list  (a: as)  nil  =  false 

I  equal.list  nil  (b:bs)  =  false 

I  equal.list  (a: as)  (b:bs)  *  if  a==b  then  equal.list  as  bs 

else  false; 

We  still  need  to  generate  the  fringe.  The  foUowing  program,  which  works  in  a  lazy  functional 
language,  relies  on  the  lazy  evaluation  of  a  procedure  argument,  as  did  Henderson’s  example. 
Furthermore,  it  cannot  be  trivially  modified  to  run  correctly  in  our  system. 
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typeof  fringe  =  tree  ->  (list  N); 
def  fringe  t  =  fringe,  t  nil; 


typeof  fringe.  =  tree  ->  (list  N) ; 
def  fringe.  (Leaf  x)  tail  =  x:tail 
I  fringe.  (Node  tl  t2)  tail  «  fringe,  tl  (fringe.  t2  tail); 

The  problem  is  that  we  need  to  delay  the  second  argument  to  the  fringe,  procedure.  Our 
trick  of  changing  to  won’t  help  us  here. 

Fortunately,  there  are  other  ways  to  fringe  a  tree.  Here  are  two.  The  first  shifts  the  tree  to 
the  right,  preserving  the  fringe,  until  the  left  edge  of  the  fringe  is  near  the  top. 


typeof  fringel  =  tree  ->  (list  N) ; 
def  fringel  (Leaf  x)  =  x:nil 
I  fringel  (Node  tl  t2)  =  fringel.  tl  t2; 

typeof  fringel.  =  tree  ->  tree  ->  (list  N) ; 
def  fringel.  (Leaf  x)  t  =  x  :i  fringel  t 

I  fringel.  (Node  tl  t2)  t3  =  fringel.  tl  (Node  t2  t3) ; 

Or  we  can  simply  keep  track  of  the  spine  along  which  we  descended. 

typeof  fringe2  «  tree  ->  (list  N); 
def  fringe2  t  =  fringe2_  t  nil; 

typeof  f ringe2.  *  tree  ->  (list  tree)  ->  (list  N) ; 
def  fringe2.  (Leaf  x)  nil  =  x:nil 

I  fringe2.  (Leaf  x)  (t:ts)  =  x  :f  fringe2.  t  ts 

I  fringe2.  (Node  tl  t2)  ts  *  fringe2_  tl  (t2:ts); 

It  is  interesting  to  compare  the  different  solutions.  The  first  recursive  call  to  fringe,  needs 
a  delayed  argument,  but  the  second  recursive  call  does  not.  If  a  compiler  only  compiles  one 
version  of  a  procedure,  no  amount  of  strictness  analysis  will  optimize  this  away.  The  laziest 
(and  most  expensive)  case  must  be  applied  universally. 

Continuing  with  the  fringe,  procedure,  since  the  traversal  requires  one  procedure  call  for 
each  nodo  in  the  graph,  there  is  a  delayed  and  subsequently  forced  expression  for  both  the 
internal  and  fringe  nodes.  So,  on  the  average,  there  are  two  delays  and  two  forces  for  each 
fringe  element  generated. 

Now  consider  either  of  the  other  solutions.  There  is  exactly  one  force  and  one  delay  per 
fringe  element  generated,  the  theoretical  minimum. 
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The  point  is  not  investigating  clever  algorithms  per  se.  The  original  algorithm  is  slick,  but 
surprisingly  expensive.  The  other  algorithms  delay  only  what  is  necessary. 

4.2.3  The  Non-Strict  Tree 

The  following  example  is  drawn  from  [26]  where  it  is  credited  to  Bird  [12].  In  a  single  traversal,  it 
replaces  each  element  of  the  fringe  by  the  minimum  element  of  the  original  fringe.  Although  this 
program  is  featured  as  depending  on  lazy  evaluation,  it  only  requires  non-strict  evaluation  [35]. 
A  new  variety  of  tree  is  used  that  has  values  at  the  leaves  but  not  at  the  internal  nodes. 

type  ntree  *  tip  N  |  fork  ntree  ntree; 

typeof  traverse  =  ntree  ->  N  ->  (ntree, N); 
det  traverse  (tip  n)  x  =  (tip  x) ,  n 
I  traverse  (fork  L  R)  x  =  {  Ll,xL  =  traverse  L  x; 

Rl,xR  =  traverse  R  x; 
xl  ®  Ein  xL  xR 
In 

(fork  LI  Rl),  xl}  ; 

typeof  8olution_2  »  ntree  ->  ntree; 
def  solution_2  t  =  {  tl,x  »  traverse  t  x 

In 

tl}  ; 

It  is  interesting  to  ask  if  we  can  produce  the  the  result  lazily  in  our  language,  and  the  answer 
is;  not  very  easily.  Again,  the  problem  is  abstraction  over  data-structure  construction.  But, 
no  piece  of  the  answer  can  be  produced  before  the  entire  input  is  traversed.  So,  if  the  result 
is  prodi:ced  lazily,  we  are  effectively  doing  two  passes  anyway,  and  much  more  straightforward 
algorithms  are  possible. 

4.3  Power  and  Limitations  of  Lazy  Data-Structures 

We  have  considered  a  host  of  examples  to  gain  experience  with  the  power  and  limitations  of  our 
system.  We  conclude  with  a  discussion  of  the  power  and  limitations  of  lazy  data-structures. 

4.3.1  More  Expressive  Power 

We  can  now  write  programs  with  streams  and  other  lazy  data-structures,  an  ability  not  pre¬ 
viously  available  in  Id.  Lazy  arrays  are  a  particularly  dramatic  example,  as  they  provide  a 
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power  in  excess  of  that  available  in  a  lazy  functional  language.  This  is  somewhat  slippery,  as 
arrays  are  not  typically  available  in  functional  languages.  Unless  array  generation  is  extremely 
restricted,  it  is  not  possible  to  propagate  demands  through  an  array,  cis  index  functions  are  not 
always  available  or  invertible.  Consider  the  following  lazy  array  comprehension: 

{  array  (l,n)  I  [perm  i]  i  f  i  II  i  <-  1  to  n} 

In  order  to  compute  a  slot  value,  the  producer  must  be  determined.  But  this  cannot  be 
done  without  at  least  some  evaluation  of  the  index  function.  In  general,  we  will  have  to  evaluate 
the  index  function  until  it  generates  the  desired  value.  There  is  no  other  way  to  “propagate 
demand”  unless  we  can  invert  the  index  function. 

In  our  language,  no  pretense  is  made  about  laziness  with  respect  to  the  index  calculation: 
the  index  calculations  are  performed  eagerly,  determining  the  producer  of  each  slot,  but  the 
value  expressions  are  left  unevaluated.  Work  is  done  to  put  a  thunk  in  every  lazy  slot,  and,  in 
that  way,  the  producer  of  the  value  for  each  slot  is  manifest.  Of  course,  some  slots  can  be  filled 
eagerly,  some  lazily,  and  some  not  at  all. 

4.3.2  Lack  of  Fidelity 

What  you  see  is  what  you  get.  What  you  meant,  on  the  other  hand,  may  be  somewhat  different. 
For  example,  we  must  be  careful  to  make  a  distinction  between  the  following  two  expressions. 
The  first  one  forces  the  potentially  delayed  tail  of  s  and  then  includes  it  in  a  delayed  expression, 
perhaps  too  late  to  delay.  The  second  one  does  not  risk  any  premature  forcing. 

{(h:t)  =  s  in  h  :#  t} 

{(h:_)  =  s  in  h}  :i  {(_:t)=  s  in  t} 

The  second  form,  which  is  more  conservative,  is  clumsy.  This  subtlety  arises  in  practice. 
Consider  the  following  function  to  filter  a  stream: 

typeof  sfilter  =  (*0  ->  B)  ->  (list  *0)  ->  (list  *0); 
def  sfilter  p  nil  =  nil 
I  sfilter  p  (x:xs)  =  if  p  x  then  x  :#  sfilter  p  xs 

else  sfilter  p  xs; 

Even  though  xs  is  named  in  both  branches  of  the  conditional,  the  consequent  is  a  stream 
cons  and  might  never  be  evaluated.  But,  as  defined,  xs  always  gets  a  value.  A  more  conservative 
definition  would  be  the  following: 
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def  sfilter  p  nil  =  nil 
I . .sfilter  p  s  = 

■C  (x:_)  =  s 
in 

if  p  X  then  x  :#  sfilter  p  {_:xs  =  s  in  xs} 
else  sfilter  p  {_:xs  =  s  in  xs}}; 

So,  our  original  definition  looked  fine,  but  it  did  a  little  more  work.  This  is  not  a  problem, 
normally,  where  we  axe  concerned  with  controlling  infinite  streams,  but  it  does  extra  computa¬ 
tion,  and  might  not  be  safe. 

The  following  program  has  a  similar  difficulty  and  demonstrates  the  difficulty  precisely.  We 
would  like  a  program  that  touches  the  first  n  elements  of  a  stream,  but  the  following,  contrary 
to  rur  intuition,  touches  n+1  elements; 

typeof  teike  =  N  ->  (list  *0)  ->  (list  *0) ; 
def  take  Os  =  nil 
I  .  .tadce  n  (x:xs)  =  x  :  taka  (n-1)  xs; 

It  has  been  suggested  that  this  problem  is  due  to  the  way  we  compile  pattern  matching  [36]. 
We  could  move  the  selection  of  slots  “inward”  to  the  lexical  use,  so  that  they  are  not  selected 
in  as  many  cases.  This  idea  seems  right  and  merits  additional  study. 

4.3.3  Limitations  of  the  Lazy  Data-Structures 

Our  system  is  not  as  expressive  as  a  lazy  system.  There  are  tv  ~  ways  that  we  see  the  difference. 
First,  arguments  to  procedures  are  not  delayed.  And  second,  we  cannot  abstract  over  lazy  con¬ 
structors  and  preserve  laziness.  The  second  point,  while  subsumed  by  the  first,  is  independently 
interesting. 

We  provide  no  mechanism  for  abstracting  over  lazy  data-structures.  This  is  analogous  to 
the  problem  present  in  eager  languages  of  abstracting  over  conditionals.  Consider  the  following 
user-defined  conditional  (equivalent  to  Lisp’s  and  function,  which  is  also  sequential): 

typeof  if_nil  =  (list  *0)  ->  (list  *1)  ->  (list  *1); 
def  if_nil  1st  val  =  if  nil  ==  1st  then  nil  else  val; 

According  to  eager  semantics,  the  actual  expression  for  val  would  be  evaluated  before 
if-nil  were  applied.  All  applications  (under  an  eager  interpreter)  of  the  following  function  for 
making  a  copy  of  a  list  would  run  off  the  end  of  the  list  and  produce  an  error.  A  lazy  system 
would  interpret  the  program  as  desired. 
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typeof  copy.list  =  (list  *0)  ->  (list  *0) ; 

def  copy_list  1st  =  if_nil  1st  {(a: as)  =  1st  in  a  :  copy.lst  as}; 

This  inability  to  abstract  over  conditionals  is  analogous  to  our  inability  to  abstract  over 
lazy  data-constructors.  By  the  time  an  expression  reaches  a  lazy  data-constructor,  it  is  already 
evaluated. 

4.4  Summary 

This  chapter  demonstrates  the  utibty  of  our  approach  as  well  as  the  limitations  through  a 
collection  of  examples.  We  are  able  to  deal  with  stream  programming  to  a  large  extent,  and 
notice  that  new  library  routines  are  needed.  As  we  tackle  more  complex  problems,  this  deficiency 
emerges  as  a  difficulty  with  abstracting  over  lazy  assignment  in  general. 

If  we  can  solve  a  problem  in  Id^^,  we  end  up  with  fine  grain  control  over  what  is  lazy  and 
what  is  not.  Many  of  the  classical  examples  can  be  solved  in  Id^^.  We  hypothesize  that  this  is 
due  primarily  to  the  following  two  reasons.  Laziness  is  often  used  to  achieve  non-strict  behavior. 
Our  system  already  has  non-strict  behavior,  obviating  this  use  for  laziness.  Also,  laziness  is 
typically  associated  with  data  structures. 
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Chapter  5 

Conclusions 


We  have  presented  an  extension  to  the  the  dataflow  language  Id  for  supporting  lazy  data- 
structures,  as  well  as  extensions  to  the  compiler  and  architecture.  In  this,  the  concluding 
chapter,  we  present  a  demonstration  of  the  cost  of  data  structure  strictness,  a  challenge  for  the 
advocates  of  lazy  functional  languages,  and  concluding  discussions. 

5.1  A  Demonstration:  the  Cost  of  Data  Structure  Strictness 

Although  conventional  languages  are  strict  in  procedure  arguments,  data  structures  may  be 
passed  around  before  they  are  completely  defined,  i.e.,  non-strictly,  and  synchronization  be¬ 
tween  producer  and  consumer  is  done  explicitly.  In  a  “conventional  multiprocessor”,  both  for 
efficiency  and  manageability,  synchronization  is  likely  performed  on  a  large  grain  basis.  For 
example,  a  vector  may  be  made  readable  only  after  it  is  completely  defined.  Or,  a  matrix  may 
be  made  available  one  row  at  a  time.  Individual  synchronization  can  be  viewed  as  enforcing  a 
strictness  constraint  equivalent  to  data  availability.  And,  blocked  synchronization  is  analogous 
to  building  entire  data-structures  strictly;  all  the  data  must  be  in  place  before  any  values  are 
made  available. 

Data  structure  strictness,  however,  is  not  free.  We  demonstrate  the  cost  of  the  explicit 
synchronization  of  data-structures  with  some  examples. 

5.1.1  Paraffins 

Considef  the  algorithm  of  Section  4.1.3  for  efficiently  enumerating  the  paraffins  (molecules  with 
structural  formula  C'n/f2n+2)-  The  radicals  (Cn^zn+i)  are  defined  using  complete  induction. 
As  a  group,  the  radicals  of  size  n  (n  carbon  atoms)  depend  on  the  radicals  of  sizes  0  to  n  -  1. 
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The  following  Id  code  allows  the  radicals  to  be  computed  with  or  without  barriers  between  the 
production  of  radicals  of  successive  sizes.  A  barrier  insures  that  the  preceding  phase  completes 
before  the  succeeding  phase  begins. 

type  radical  =  H  |  Rad  N  radical  radical  radical; 

typeof  traverse.radical  =  radical  ->  N; 
def  traverse_radical  H  =  0 

1 . . traverse.radical  (RAD  n  rl  r2  r3)  = 

{  trl  =  traverse_radical  rl; 

tr2  =  traverse.radical  r2; 
tr3  =  traverse.radicaLL  r3 
in 

if  (strict  n)  then  (trl  tr2  +  tr3)  else  0}; 

typeof  traverse_radical_list  =  (list  radical)  ->  N; 
def  traverse_radical_list  nil  =  0 

1 . . traverse_radical_list  (r:x)  = 

1  +  (traverse_radical  r)  +  (traverse_radical_list  x) ; 

typeof  traverse_radical_lists  =  (array  (list  radical))  ->  (airray  N) ; 
def  traverse.radical.lists  radicals  = 

{  do, hi)  s  bounds  radicals 
in 

{  array  (lo.hi) 

I  [i]  =  traverse.radical.list  radicals [i] 

II  i  <-  lo  to  hi}}; 

traverse_radical_lists,  which  traverses  each  list  of  radicals  and  returns  an  array  of  the 
number  of  radicals  of  every  size,  implements  the  barrier.  All  operations  eissociated  with  any  of 
the  traverse  procedures,  however,  are  masked  from  the  collected  statistics. 

The  gen_rads_bar  procedure  defines  the  array,  strict_array;  the  tth  slot  becomes  defined 
when  all  the  radicals  of  size  t  are  fully  defined.  If  the  extra  argument  barrier?  to  rad_gen_bar 
is  true,  radicals  of  size  n  —  1  are  completely  defined  before  any  radicals  of  size  n  are  derived. 

typeof  gen_rads_bar  =  N  ->  B  ->  (array  N) ; 
def  gen_rads_bar  w  barrier?  = 

■C  def  rgen  wpl  gatel  = 

Rad  wpl  rl  r2  r3  II 
w  =  wpl-1 

t  wl  <-  fix(0*gatel)  to  fix(w/3) 

t  rl:rltl  <-  tails  radicalsCsl] 
k  w2  <-  wl  to  fix((w-wl)/2) 

k  r2:r2tl  <-  tails  (if  wl<w2  then  radicals [w2]  else  rl:rltl) 
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ft  w3  *  w-wl-w2 

ft  r3  <-  if  w2<w3  then  radicals [h3]  else  r2:r2tl}; 
radicals  = 

•C  array  (0,w) 

I  [0]  *  H  :  nil 

I  [i]  =  rgen  i  atrict_array[if  barrier?  then  i-1  else  0] 

1 1  i<-l  to  s}; 

strict_array  =  traverse.radical.lists  radicals 
in  strict.array}; 

The  paxallelism  profile^  in  Figure  5.1  describes  the  construction  of  the  radicals  of  size  eight 
or  less,  where  the  construction  of  any  radicals  of  size  n  does  not  begin  until  all  radicals  of  size 
n  —  1  are  completed.  Barriers  are  present  between  the  construction  of  radicals  of  different  sizes. 
The  domain  is  time  steps,  and  the  range  is  the  number  of  parallel  ALU  operations. 
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ALU  OPERATIONS  PROFILE  IN  GEN  RADS'BAR  (8  $T) 

Figure  5.1:  Parallelism  Profile  for  Strict  Generation  of  Radicals 

The  parallelism  profile  in  Figure  5.2  describes  the  construction  of  the  radicals  of  size  eight 
or  less,  where  the  construction  of  radicals  of  all  sizes  begins  as  soon  as  possible  in  an  overlapped 
and  non- strict  fashion. 

In  the  strict  case,  15,805  operations  are  performed,  and  the  critical  path  is  1788.  In  the 
non-strict  case,  15,805  operations  are  performed,  and  the  critical  path  is  705.  In  the  original 
algorithm,  15,708  operations  are  performed,  and  the  critical  path  is  681.  The  extra  operations 
are  the  result  of  the  procedure  linkage  to  traverse-radicalJLists,  and  the  conditional  barrier. 

’  A  parallelism  profile  plots  the  number  of  operations  that  can  be  executed  in  parallel  for  a  particular  program 
and  input  under  a  greedy  schedule.  The  two  curves  envelope  the  actual  locus. 
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ALU  OPERATIONS  PROFILE  IN  GEN  RADS  BAR  (8  $F) 


Figure  5.2;  Parallelism  Profile  for  Non-Strict  Generation  of  Radicals 


The  actual  operations  <tssociated  with  the  traversal  procedures  are  masked  from  the  statistics. 

Strictness  implies  serialization  and  costs  us  parallelism.  The  non-strict  case  has  a  shorter 
critical  path  and  more  parallelism.  The  shape  of  the  parallelism  profile  is  important,  too.  The 
many  constrictions  make  it  difficult  to  keep  a  parallel  machine  busy.  Also,  synchronization 
must  be  performed  explicitly;  the  masked  operations  associated  with  traversejradical-lists 
accounted  for  the  majority  of  the  operations. 

5.1.2  Insertion  Sort 

In  this  section  we  consider  a  functional  implementation  of  insertion  sort.  A  list  is  sorted  by 
successively  inserting  each  element  into  a  sorted  list,  insert  .elements  calls  insert.element 
to  insert  each  element  of  the  original  list  into  a  sorted  intermediate  result,  insert ion_sort 
calls  insert.elements  with  the  initial  list,  and  an  empty  list  (which  is  vacuously  sorted). 

typeof  insertion.sort  ®  (list  n)  ->  (list  n) ; 
def  insertion. sort  as  «  insert.elements  as  nil; 

typeof  insert.elements  «  (list  n)  ->  (list  n)  ->  (list  n) ; 
def  insert.elements  nil  sorted.list  -  sorted.list 
I  insert.elements  (a: as)  sorted.list  » 

insert.elements  as  (insert.element  a  sorted.list) ; 

typeof  insert.element  ■  n  ->  (list  n)  ->  (list  n) ; 
def  insert.element  a  nil  <■  a:nil 
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I  insert.element  a  (x:x8)  « 
if  a  <  X  then 
a:x:xs 
else 

x: insert.element  a  xs; 

If  we  enforce  data-structure  strictness,  insertions  proceed  one  element  at  a  time.  The  next 
insertion  cannot  begin  until  the  previous  one  has  completed.  The  synchronization,  however,  is 
simpler  than  in  the  paraffins  example,  as  the  list  is  traversed  sequentially.  So,  when  we  reach 
the  correct  position,  the  prefix  must  be  defined.  Or  must  it?  Doesn’t  this  inductive  fact  simplify 
synchronization?  Not  necessarily.  Unless  we  enforce  extraneous  synchronization  along  the  way, 
the  “heads”  of  the  prefix  cells  could  be  proceeding  in  parallel.  So,  we  must  synchronize  the 
entire  list. 

A  non-strict  approach  allows  the  successive  elements  to  be  inserted  in  a  pipelined  and 
overlapped  fasliion.  As  soon  as  a  prefix  is  computed,  it  is  returned,  partially  defined,  and  the 
next  element  can  begin  its  crawl  down  the  list. 

What  if  we  destructively  update  a  shared  list  that  has  some  method  for  locking?  Although 
Id  has  no  facilities  for  this  destructive  behavior,  we  hypothesize  the  results  for  a  parallel  system 
that  allows  updates  such  as  Halstead’s  Multilisp  on  Concert  [21,  22].  The  same  mechanism 
that  is  used  for  locking  may  subsume  our  producer  consumer  synchronization.  After  all,  that 
is  what  locking  and  exclusive  access  are  all  about. 

5.1.3  Discussion 

Two  points  merit  exposition: 

•  Strictness  limits  parallelism,  as  it  implies  sequentialization. 

•  Synchronization  is  expensive.  Fine  grain  synchronization  of  data  can  be  accomplished 
using  I-structure  memory.  Large  grain  synchronization,  which  implies  the  coordination  of 
a  set  of  fine  grain  synchronizations,  is  more  complex  in  an  unordered  environment.  In 
a  SIMD  machine  this  is  easy  as  computation  proceeds  in  lock  step,  but  the  synchroniza^ 
tion  cost  is  paid  constantly.  In  a  MIMD  machine,  where  flexible  evaluation  order  buys 
utilization,  coordinated  synchronization  is  expensive. 
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5.2  A  Challenge:  The  Difficulty  of  Optimization 


Strictness  analysis  is  a  compilation  technique  for  deducing  information  about  argument  strict¬ 
ness.  If  a  procedure  is  strict  in  an  argument,  that  argument  need  not  be  passed  lazily.  Abstract 
interpretation  [32,  16,  25]  and  context  analysis  [46]  are  two  techniques  for  strictness  analysis. 
Since  the  problem  is  unsolvable  in  general,  these  approaches,  as  well  as  any  other  approaches 
to  strictness  analysis,  are  necessarily  approximation  techniques. 

Consider  the  closure_under_laws  procedure  of  Section  4.1.3.  It  produces  the  equivalence 
class  of  a  particular  paraffin,  i.e.,  all  the  legal  representations.  There  is  no  need  for  laziness  in 
this  procedure,  but  this  is  not  easy  to  show.  After  a  particular  paraffin,  say  Pq,  is  generated,  no 
equivalent  paraffin  should  be  enumerated  in  the  result.  Suppose  P\  is  next  in  the  enumeration. 
Before  Pj  can  be  produced  we  must  check  that  it  is  not  in  the  equivalence  class  of  Pq,  at  which 
point  the  equivalence  class  of  Pq  must  be  fully  expanded.  Before  finding  Pi,  however,  we  may 
have  checked  P2,  P3,  and  P4,  only  to  discover  they  were  equivalent  to  Pq.  Each  replica  will 
have  expanded  the  equivalence  class  of  Pq  to  include  itself.  However,  at  no  quiescent  point  of 
the  algorithm  are  any  equivalence  classes  partially  expanded. 

Why  is  it  difficult  to  deduce  the  strictness  of  closure_underJ.aws? 

1.  The  strictness  must  be  deduced  from  context.  It  is  the  way  that  closure-under-laws  is 
called  that  leads  to  the  strictness. 

2.  Only  some  of  the  calls  to  closure_under_la»s  require  the  result  to  be  fully  defined.  It 
happens  that  the  algorithm  loops  until  it  makes  one  of  these  “strict  calls”. 

Strictness  analysis  in  the  presence  of  higher-order  functions  and  data  structures  is  already 
complex.  In  order  to  produce  efficient  compiled  code  for  closur0_under_laws,  however,  we 
must  have  algorithmic  insight,  a  difficult  task  for  a  compiler. 

5.3  Conclusions 

5.3.1  Thunk  Efficiency 

We  have  claimed  that  our  mechanism  is  efficient.  In  the  introductory  chapter,  we  described 
several  key  efficiency  issues  that  are  important  when  we  implement  delayed  computation.  Have 
we  addressed  these  issues? 
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1.  sharing 


2.  the  use  of  thunks  where  lazy  evaluation  is  not  necessary 

3.  the  speed  of  the  operations  required  to  manipulate  thunks 

1.  Sharing  is  accomplished  naturally.  An  expression  is  memoized  in  the  data  structure  in 
which  it  belongs,  and  is  computed  at  most  once. 

2.  Delaying  is  imposed  at  the  request  of  the  programmer,  and,  although  extra  delaying  may 
occur,  it  is  unlikely. 

3.  The  operations  are  efficient.  Creating  a  thunk  is  no  more  than  storing  the  environment, 
and  evaluating  a  delayed  expression  is  about  as  complex  as  applying  a  procedure.  The  big  win 
is  in  subsequent  references  to  the  value,  which  incur  no  overhead  resulting  from  the  fact  that 
the  value  was  once  a  delayed  expression. 

5.3.2  Embedding  Henderson’s  System 

As  we  mentioned  in  the  introductory  chapter,  the  framework  presented  for  lazy  structures  can 
embed  a  Henderson-style  source-to-source  transformation  for  achieving  lazy  behavior.  Consider 
the  following  transformations.  Delay  and  Force  are  implemented  with  lazy  data-structures. 

delay  <exp>  =>  {vector  (0,0)  [0]  #  <exp>} 

force  <exp>  =>  exp[0] 

Although  this  approach  is  roughly  as  efficient  as  standard  solutions,  we  do  not  advocate  it. 
Any  Henderson-style  system  is  likely  to  end  up  with  too  many  thunks  and  be  too  expensive. 

5.3.3  Variations  and  Future  Directions 

VVe  have  developed  a  technique  which  depends  on  explicit  delaying  and  implicit  forcing.  The 
implicit  part  relies  on  a  hardware  mechanism  for  synchronization  and,  not  surprisingly,  is  very 
efficient.  If  a  similar  mechanism  were  available  to  trap  delayed  values  so  that  normal  processing 
were  uninterrupted,  we  might  investigate  delaying  expressions  in  general.  Trapping  such  delayed 
expressions  can  be  done,  given  hardware  support  in  many  scenarios,  if  we  are  willing  to  give  up 
sharing.  We  might  view  such  a  system  as  similar  to  Lisp’s  invisible  forwarding  pointer  system. 

Several  related  approaches  are  worth  considering.  All  of  the  following  possibilities  are  vari. 
aticns  of  the  language  or  the  semantics  but  preserve  the  underlying  implementation. 
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In  the  cases  that  data  structures  are  involved,  lazy  structures  can  be  thought  of  as  simply 
an  optimization  in  any  lazy  language  on  a  tagged  architecture.  Rather  than  have  a  thunk 
resident  in  a  data  structure  slot,  use  structure  tag-bits  to  implement  L-structures,  cutting  out 
the  intermediate  storage.  Implicit  forcing  comes  as  a  bonus. 

Another  possibility  is  eager  evaluation  for  all  expressions  except  those  destined  for  data 
structure  slots.  In  order  to  guarantee  termination,  it  may  suffice  to  guarantee  that  all  circular 
expressions  are  cut  by  data-structures,  in  much  the  same  way  that  circular  combinational  logic 
must  be  cut  by  storage  elements.  Either  strictness  analysis  or  annotations  may  be  useful  for 
generating  optimized  programs.  This  approach  is  similar  to  Burtons  approach  [15]. 

Similarly,  we  can  assign  even  structure  slots  eagerly,  except  those  that  contain  pointers  to 
other  structures.  We  could  benefit  from  compiler  analysis  (type  checking)  and  annotations, 
both  eager  and  lazy. 

The  repercussions  of  these  varied  choices  are  not  clear.  An  interesting  approach  involves 
using  the  techniques  of  Lucassen  [31]  and  Young  [48].  Lucassen  developed  techniques  for  cate¬ 
gorizing  expressions  as  pure  and  side-effecting.  Due  to  the  presence  of  the  I-structure  language 
construct  in  Id,  changing  an  expression’s  evaluation  semantics  to  eager  or  lazy  may  change  its 
meaning.  Analysis  similar  to  Lucassen’s  may  avoid  these  difficulties.  Young  developed  tech¬ 
niques  for  approximating  the  cost  of  expressions  and  approximating  termination  behavior.  If 
the  cost  of  evaluating  an  expression  is  cheaper  than  the  cost  of  building  a  thunk,  then  eager 
evaluation  is  more  efficient,  assuming,  of  course,  that  the  meaning  of  the  expression  is  the  same. 

5.3.4  Concluding  Remarks 

Our  thesis  can  be  stated  concisely  as  follows.  An  eager  non-strict  language  plus  lazy  data  struc¬ 
tures  with  otherwise  eager  semantics  provides  most  of  the  expressive  power  of  a  lazy  functional 
language  and  an  opportunity  for  implementation  at  close  to  the  cost  of  an  implementation  of 
an  eager  non-strict  language. 

The  language  Id#,  Id  plus  lazy  data-structures,  was  presented  in  Chapter  2,  and  its  expres¬ 
sive  power  as  well  as  its  limitations  were  demonstrated  in  Chapter  4.  Chapter  3  presents  an 
efficient  implementation. 

This  much  is  clear:  most  expressions  can  be  evaluated  eagerly  with  no  loss  of  expressive 
power  (chance  of  non-termination).  An  eager  functional  language  is  already  very  close  to  the 
target.  A  lazy  functional  language,  on  the  other  hand,  is  far  from  the  target,  and,  if  we  can 
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get  there  at  all,  it  is  sure  to  be  a  struggle.  Great  progress  has  been  made  in  strictness  analysis, 
recently  including  non-flat  domains  [20,  46)  (i.e.,  data  structures).  But,  it  is  unlikely  that  we 
will  reach  the  target.  We  have  already  pointed  out  the  problem  in  analyzing  indexed  structures. 
Furthermore,  the  problem  of  higher-order  functions  and  data  structures  simultaneously  appears 
quite  difficult. 

To  the  purist,  who  is  unwilling  to  accept  any  annotation,  we  simply  note  that,  as  demon¬ 
strated,  these  annotations  fit  naturally  into  the  source  language,  and  have  clear  semantics. 
Furthermore,  the  repercussions  are  clear,  and  confined. 

For  the  eager  non-strict  programmer,  who  is  willing  to  take  responsibility  for  her  program’s 
actions,  we  hope  to  have  opened  up  a  gamut  of  possibilities,  both  powerful  and  efficient. 
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